Safaa Dabagh

Lesson 2: Chi-Square Test of Independence

Testing Whether Two Categorical Variables Are Associated

Learning Objectives

By the end of this lesson, you will be able to:

What is Independence?

In statistics, two categorical variables are independent if knowing the category of one variable doesn't give us any information about the other variable.

Independent Variables

No association or relationship

Example: Hair color and favorite ice cream flavor are likely independent - knowing someone's hair color doesn't help predict their ice cream preference.

Dependent (Associated) Variables

There IS an association

Example: Education level and income are likely dependent - higher education tends to be associated with higher income.

When to Use Test of Independence

Research question pattern: "Is there a relationship between [variable 1] and [variable 2]?"

Contingency Tables (Two-Way Tables)

A contingency table (also called a two-way table) displays the frequency distribution of two categorical variables:

Variable 2 →
Variable 1 ↓ Category A Category B Row Total
Category 1 Count Count Total
Category 2 Count Count Total
Column Total Total Total Grand Total

Step-by-Step Procedure

Step 1: State the Hypotheses

H₀ (Null Hypothesis): The two variables are independent (no association)

Hₐ (Alternative Hypothesis): The two variables are dependent (there is an association)

Step 2: Check Conditions

Step 3: Calculate Expected Frequencies

Expected Frequency Formula for Each Cell:

E = (Row Total × Column Total) / Grand Total

Calculate this for EVERY cell in the table

Step 4: Calculate the Test Statistic

Chi-Square Test Statistic:

χ² = Σ[(O - E)² / E]

Sum over ALL cells in the contingency table

Step 5: Find Degrees of Freedom

Degrees of Freedom:

df = (r - 1)(c - 1)

Where: r = number of rows, c = number of columns

Step 6: Find the p-value or Critical Value

Use chi-square distribution table or technology

Step 7: Make a Conclusion

State your conclusion about whether the variables are independent or associated

Complete Example: Gender and Political Party

Example 1: Gender and Political Affiliation

Problem: A political scientist surveys 500 randomly selected voters and records their gender and political party preference. The data is shown below:

Democrat Republican Independent Row Total
Male 120 90 40 250
Female 100 110 40 250
Column Total 220 200 80 500

Question: At the α = 0.05 significance level, is there evidence that gender and political party preference are associated?

Solution:

Step 1: State the hypotheses

Step 2: Check conditions

Step 3: Calculate expected frequencies

For each cell: E = (Row Total × Column Total) / Grand Total

Expected frequency calculations:

Democrat Republican Independent
Male O=120, E=110 O=90, E=100 O=40, E=40
Female O=100, E=110 O=110, E=100 O=40, E=40

All expected frequencies are ≥ 5!

Step 4: Calculate the chi-square test statistic

χ² = Σ[(O - E)² / E] for all 6 cells:

Cell O E (O - E)²/E
Male & Democrat 120 110 0.909
Male & Republican 90 100 1.000
Male & Independent 40 40 0.000
Female & Democrat 100 110 0.909
Female & Republican 110 100 1.000
Female & Independent 40 40 0.000
Total χ² 3.818

χ² = 3.818

Step 5: Find degrees of freedom

df = (r - 1)(c - 1) = (2 - 1)(3 - 1) = 1 × 2 = 2

Step 6: Find p-value or critical value

Using a chi-square table with df = 2 and α = 0.05:

Critical value = 5.991

Since our χ² = 3.818 < 5.991, we fail to reject H₀

(Alternatively, p-value ≈ 0.148 > 0.05)

Step 7: Conclusion

At the 0.05 significance level, there is insufficient evidence to conclude that gender and political party preference are associated. The data are consistent with the two variables being independent.

Check Your Understanding

Question: In the political party example, what does it mean practically that we failed to reject the null hypothesis?

Answer: C) We don't have enough evidence to say gender and political party are related

Explanation: Failing to reject H₀ means we don't have sufficient evidence of an association. It does NOT prove independence - it just means our data is consistent with independence. The distinction is important: we never "prove" the null hypothesis, we simply lack evidence against it.

Another Example: 2×2 Table

Example 2: Treatment and Outcome

Problem: A medical researcher randomly assigns 200 patients with migraines to either a new drug treatment or a placebo. After one month, they record whether each patient experienced significant improvement:

Improved Not Improved Row Total
Drug 70 30 100
Placebo 50 50 100
Column Total 120 80 200

Question: At α = 0.01, is there evidence that treatment type and improvement are associated?

Complete Solution:

Step 1: Hypotheses

  • H₀: Treatment type and improvement are independent
  • Hₐ: Treatment type and improvement are associated

Step 3: Expected frequencies

Cell Calculation Expected
Drug & Improved (100 × 120) / 200 60
Drug & Not Improved (100 × 80) / 200 40
Placebo & Improved (100 × 120) / 200 60
Placebo & Not Improved (100 × 80) / 200 40

All expected frequencies ≥ 5

Step 4: Calculate χ²

Cell O E (O-E)²/E
Drug & Improved 70 60 1.667
Drug & Not Improved 30 40 2.500
Placebo & Improved 50 60 1.667
Placebo & Not Improved 50 40 2.500
Total 8.334

χ² = 8.334

Step 5: Degrees of freedom

df = (2 - 1)(2 - 1) = 1 × 1 = 1

Step 6: Critical value

With df = 1 and α = 0.01, critical value = 6.635

Since 8.334 > 6.635, we REJECT H₀

(p-value ≈ 0.0039 < 0.01)

Step 7: Conclusion

At the 0.01 significance level, there is sufficient evidence to conclude that treatment type and improvement are associated. The drug appears to be more effective than the placebo.

Check Your Understanding

Question: What would be the degrees of freedom for a contingency table with 4 rows and 5 columns?

Answer: B) 12

Explanation: df = (r - 1)(c - 1) = (4 - 1)(5 - 1) = 3 × 4 = 12. We subtract 1 from both the number of rows and columns, then multiply.

Interpreting Results

What Does the Test Tell Us?

If we REJECT H₀ (χ² is large, p-value is small):

If we FAIL TO REJECT H₀ (χ² is small, p-value is large):

Important Note: Association ≠ Causation! Even if variables are associated, this doesn't prove one causes the other.

Interactive Contingency Table Calculator

Enter your own data to see expected frequencies and χ² calculation

Key Takeaways

Remember These Points

Continue to Lesson 3: Test of Homogeneity →

← Previous: Lesson 1 (Goodness of Fit)

← Back to Module 12 Overview