Safaa Dabagh

Lesson 4: Choosing the Right Chi-Square Test

Decision Making, Conditions, and Common Mistakes

Learning Objectives

By the end of this lesson, you will be able to:

Decision Flowchart: Which Chi-Square Test?

START: What type of data do you have?

How many categorical variables are you analyzing?
ONE variable
TWO variables
Quantitative data?
Use t-test or ANOVA,
NOT chi-square!
How many samples/populations?
ONE variable +
ONE sample


→ Goodness of Fit
Testing if data fits
expected distribution
TWO variables +
ONE sample


→ Independence
Testing if variables
are associated
ONE variable +
MULTIPLE samples


→ Homogeneity
Comparing distributions
across groups

Quick Comparison Table

Feature Goodness of Fit Independence Homogeneity
Variables 1 categorical 2 categorical 1 categorical
Samples 1 sample 1 sample 2+ samples
Research Question Does data fit expected distribution? Are the variables related? Do groups have same distribution?
Expected Frequency E = n × p E = (RT × CT) / GT E = (RT × CT) / GT
Degrees of Freedom df = k - 1 df = (r-1)(c-1) df = (r-1)(c-1)
Example Is a die fair? Gender vs party preference? Do three cities have same political views?

Practice Identifying the Test

Scenario 1

A genetics researcher observes the flower colors in 200 pea plants. According to Mendelian genetics, the colors should appear in a 3:1 ratio (red to white). The researcher wants to test if the observed data matches this theoretical ratio.

Test: Goodness of Fit

Reasoning:

  • ONE variable: flower color (categorical)
  • ONE sample: 200 pea plants
  • Question: Does the observed distribution match the expected 3:1 ratio?

Setup: df = k - 1 = 2 - 1 = 1 (two categories: red and white)

Scenario 2

A marketing team surveys 500 random consumers and records both their age group (18-34, 35-54, 55+) and their preferred shopping method (online, in-store, both). They want to know if age and shopping preference are related.

Test: Independence

Reasoning:

  • TWO variables: age group AND shopping preference (both categorical)
  • ONE sample: 500 consumers classified by both variables
  • Question: Are age and shopping preference associated?

Setup: df = (r-1)(c-1) = (3-1)(3-1) = 4

Scenario 3

A school district wants to compare student satisfaction across three different schools. They randomly sample 100 students from School A, 100 from School B, and 100 from School C. Each student rates satisfaction as Low, Medium, or High. The district wants to know if satisfaction levels are the same across all three schools.

Test: Homogeneity

Reasoning:

  • ONE variable: satisfaction level (categorical)
  • MULTIPLE samples: three separate samples from three schools
  • Question: Do the three schools have the same distribution of satisfaction?

Setup: df = (r-1)(c-1) = (3-1)(3-1) = 4 (3 schools × 3 satisfaction levels)

Scenario 4

A university randomly assigns 150 students to three different study techniques (50 per technique). After the exam, students are classified as passing or failing. Researchers want to know if the three techniques have different pass/fail distributions.

Test: Homogeneity

Reasoning:

  • ONE variable: pass/fail (categorical)
  • MULTIPLE samples: students assigned to three different techniques (predetermined group sizes)
  • Question: Do the three techniques produce the same distribution of outcomes?

Setup: df = (r-1)(c-1) = (3-1)(2-1) = 2

Note: Random assignment to groups is a key indicator of homogeneity!

Scenario 5

A casino manager rolls a die 600 times to verify it's fair. All six outcomes should be equally likely if the die is fair.

Test: Goodness of Fit

Reasoning:

  • ONE variable: die outcome (categorical: 1, 2, 3, 4, 5, or 6)
  • ONE sample: 600 rolls
  • Question: Do the observed frequencies fit the expected equal distribution?

Setup: df = k - 1 = 6 - 1 = 5, Expected for each outcome = 600/6 = 100

Conditions for ALL Chi-Square Tests

Required Conditions Checklist

Before conducting ANY chi-square test, verify these conditions:

1. Random Sampling (or Random Assignment)

Data must come from a random sample or random assignment to groups. This ensures observations represent the population and aren't biased.

What if violated? Results cannot be generalized to the population.

2. Independence of Observations

Each observation must be independent - one observation shouldn't influence another.

What if violated? Test results become unreliable; actual p-values may be different from calculated p-values.

Common violations: Repeated measures on same subjects, clustered sampling without adjustment, paired data.

3. Expected Frequencies ≥ 5 (CRUCIAL!)

ALL expected cell counts must be at least 5. This is the most commonly checked condition.

Why? The chi-square distribution is only a good approximation when expected counts are sufficiently large.

What if violated? Options:

  • Combine categories (if logically reasonable)
  • Collect more data
  • For 2×2 tables, use Fisher's exact test instead
4. Large Enough Sample Size

General guideline: Total sample size should be at least 5 times the number of cells.

Example: For a 3×4 table (12 cells), you should have n ≥ 60.

5. Categorical Data (Counts/Frequencies)

Use actual counts, NOT proportions or percentages.

Common mistake: Entering percentages instead of raw counts.

Common Mistakes to Avoid

Mistake #1: Using Proportions Instead of Counts

Wrong: Entering 0.45 (45%) into the chi-square formula

Right: Entering 45 (the actual count of observations)

Why it matters: Chi-square tests require raw frequencies. Convert percentages back to counts first!

Mistake #2: Not Checking Expected Frequency Condition

Problem: Calculating χ² without verifying all E ≥ 5

Fix: ALWAYS calculate expected frequencies first and check the condition before proceeding

Example: If you have a cell with E = 3.2, you need to either combine categories or collect more data

Mistake #3: Confusing Independence and Homogeneity

Problem: These tests use identical calculations but answer different questions

Fix: Look at the study design:

Mistake #4: Using One-Tailed Tests

Problem: Trying to use a one-tailed test with chi-square

Fix: Chi-square tests are ALWAYS right-tailed. Large χ² values (in either direction of deviation) contribute to the statistic.

Why? We square the differences (O - E)², so negative deviations become positive.

Mistake #5: Using Chi-Square for Quantitative Data

Problem: Trying to use chi-square on means or measurements

Fix: Chi-square is for categorical data only!

Mistake #6: Claiming Causation from Association

Problem: Concluding that one variable causes another just because they're associated

Fix: Association ≠ Causation! Chi-square tests only detect relationships, not cause-and-effect

Example: Finding that ice cream sales and drowning deaths are associated doesn't mean ice cream causes drowning (both are related to warm weather - a confounding variable)

Chi-Square vs. Other Tests

Data Type Question Type Appropriate Test
Categorical (counts) Does distribution match expected? Chi-square goodness of fit
Categorical (counts) Are two variables related? Chi-square independence
Categorical (counts) Same distribution across groups? Chi-square homogeneity
Quantitative (means) Compare two group means Two-sample t-test
Quantitative (means) Compare 3+ group means ANOVA
Quantitative (two variables) Is there a linear relationship? Linear regression / Correlation
Categorical (proportions) Compare two proportions Two-proportion z-test

Interactive Decision Tree

Click through scenarios to practice choosing the right test

Practice Decision-Making

Quick Decision Strategy

Ask yourself these questions in order:

  1. Is the data categorical (counts/frequencies)?
    • No → Not chi-square (use t-test, ANOVA, or regression)
    • Yes → Continue to #2
  2. How many categorical variables?
    • One → Continue to #3
    • Two → It's likely Independence (if one sample) or Homogeneity (if multiple samples)
  3. How many samples?
    • One sample, one variable → Goodness of Fit
    • One sample, two variables → Independence
    • Multiple samples, one variable → Homogeneity

Key Takeaways

Remember These Points

Continue to Practice Problems →

← Previous: Lesson 3 (Test of Homogeneity)

← Back to Module 12 Overview