Save or print this lesson:

Lesson 2: Chi-Square Test of Independence

Testing Whether Two Categorical Variables Are Associated

Learning Objectives

By the end of this lesson, you will be able to:

Understand when to use the chi-square test of independence
Create and interpret contingency tables (two-way tables)
Calculate expected frequencies for independence tests
Determine degrees of freedom for contingency tables
Perform a complete chi-square test of independence
Interpret results about the relationship between two categorical variables

What is Independence?

In statistics, two categorical variables are independent if knowing the category of one variable doesn't give us any information about the other variable.

Independent Variables

No association or relationship

Example: Hair color and favorite ice cream flavor are likely independent - knowing someone's hair color doesn't help predict their ice cream preference.

Dependent (Associated) Variables

There IS an association

Example: Education level and income are likely dependent - higher education tends to be associated with higher income.

When to Use Test of Independence

Two categorical variables
One sample classified by both variables
Purpose: Test if the two variables are independent or associated

Research question pattern: "Is there a relationship between [variable 1] and [variable 2]?"

Contingency Tables (Two-Way Tables)

A contingency table (also called a two-way table) displays the frequency distribution of two categorical variables:

	Variable 2 →
Variable 1 ↓	Category A	Category B	Row Total
Category 1	Count	Count	Total
Category 2	Count	Count	Total
Column Total	Total	Total	Grand Total

Step-by-Step Procedure

Step 1: State the Hypotheses

H₀ (Null Hypothesis): The two variables are independent (no association)

Hₐ (Alternative Hypothesis): The two variables are dependent (there is an association)

Step 2: Check Conditions

Random sampling
Independent observations
All expected frequencies ≥ 5

Step 3: Calculate Expected Frequencies

Expected Frequency Formula for Each Cell:

E = (Row Total × Column Total) / Grand Total

Calculate this for EVERY cell in the table

Step 4: Calculate the Test Statistic

Chi-Square Test Statistic:

χ² = Σ[(O - E)² / E]

Sum over ALL cells in the contingency table

Step 5: Find Degrees of Freedom

Degrees of Freedom:

df = (r - 1)(c - 1)

Where: r = number of rows, c = number of columns

Step 6: Find the p-value or Critical Value

Use chi-square distribution table or technology

Step 7: Make a Conclusion

State your conclusion about whether the variables are independent or associated

Complete Example: Gender and Political Party

Example 1: Gender and Political Affiliation

Problem: A political scientist surveys 500 randomly selected voters and records their gender and political party preference. The data is shown below:

	Democrat	Republican	Independent	Row Total
Male	120	90	40	250
Female	100	110	40	250
Column Total	220	200	80	500

Question: At the α = 0.05 significance level, is there evidence that gender and political party preference are associated?

Solution:

Step 1: State the hypotheses

H₀: Gender and political party preference are independent
Hₐ: Gender and political party preference are dependent (associated)

Step 2: Check conditions

Random sample of 500 voters
Observations are independent
We'll check if all expected frequencies ≥ 5

Step 3: Calculate expected frequencies

For each cell: E = (Row Total × Column Total) / Grand Total

Expected frequency calculations:

Male & Democrat: E = (250 × 220) / 500 = 110
Male & Republican: E = (250 × 200) / 500 = 100
Male & Independent: E = (250 × 80) / 500 = 40
Female & Democrat: E = (250 × 220) / 500 = 110
Female & Republican: E = (250 × 200) / 500 = 100
Female & Independent: E = (250 × 80) / 500 = 40

	Democrat	Republican	Independent
Male	O=120, E=110	O=90, E=100	O=40, E=40
Female	O=100, E=110	O=110, E=100	O=40, E=40

All expected frequencies are ≥ 5!

Step 4: Calculate the chi-square test statistic

χ² = Σ[(O - E)² / E] for all 6 cells:

Cell	O	E	(O - E)²/E
Male & Democrat	120	110	0.909
Male & Republican	90	100	1.000
Male & Independent	40	40	0.000
Female & Democrat	100	110	0.909
Female & Republican	110	100	1.000
Female & Independent	40	40	0.000
Total χ²			3.818

χ² = 3.818

Step 5: Find degrees of freedom

df = (r - 1)(c - 1) = (2 - 1)(3 - 1) = 1 × 2 = 2

Step 6: Find p-value or critical value

Using a chi-square table with df = 2 and α = 0.05:

Critical value = 5.991

Since our χ² = 3.818 < 5.991, we fail to reject H₀

(Alternatively, p-value ≈ 0.148 > 0.05)

Step 7: Conclusion

At the 0.05 significance level, there is insufficient evidence to conclude that gender and political party preference are associated. The data are consistent with the two variables being independent.

Check Your Understanding

Question: In the political party example, what does it mean practically that we failed to reject the null hypothesis?

Answer: C) We don't have enough evidence to say gender and political party are related

Explanation: Failing to reject H₀ means we don't have sufficient evidence of an association. It does NOT prove independence - it just means our data is consistent with independence. The distinction is important: we never "prove" the null hypothesis, we simply lack evidence against it.

Another Example: 2×2 Table

Example 2: Treatment and Outcome

Problem: A medical researcher randomly assigns 200 patients with migraines to either a new drug treatment or a placebo. After one month, they record whether each patient experienced significant improvement:

	Improved	Not Improved	Row Total
Drug	70	30	100
Placebo	50	50	100
Column Total	120	80	200

Question: At α = 0.01, is there evidence that treatment type and improvement are associated?

Complete Solution:

Step 1: Hypotheses

H₀: Treatment type and improvement are independent
Hₐ: Treatment type and improvement are associated

Step 3: Expected frequencies

Cell	Calculation	Expected
Drug & Improved	(100 × 120) / 200	60
Drug & Not Improved	(100 × 80) / 200	40
Placebo & Improved	(100 × 120) / 200	60
Placebo & Not Improved	(100 × 80) / 200	40

All expected frequencies ≥ 5

Step 4: Calculate χ²

Cell	O	E	(O-E)²/E
Drug & Improved	70	60	1.667
Drug & Not Improved	30	40	2.500
Placebo & Improved	50	60	1.667
Placebo & Not Improved	50	40	2.500
Total			8.334

χ² = 8.334

Step 5: Degrees of freedom

df = (2 - 1)(2 - 1) = 1 × 1 = 1

Step 6: Critical value

With df = 1 and α = 0.01, critical value = 6.635

Since 8.334 > 6.635, we REJECT H₀

(p-value ≈ 0.0039 < 0.01)

Step 7: Conclusion

At the 0.01 significance level, there is sufficient evidence to conclude that treatment type and improvement are associated. The drug appears to be more effective than the placebo.

Check Your Understanding

Question: What would be the degrees of freedom for a contingency table with 4 rows and 5 columns?

Answer: B) 12

Explanation: df = (r - 1)(c - 1) = (4 - 1)(5 - 1) = 3 × 4 = 12. We subtract 1 from both the number of rows and columns, then multiply.

Interpreting Results

What Does the Test Tell Us?

If we REJECT H₀ (χ² is large, p-value is small):

The two variables are associated (dependent)
Knowing one variable gives information about the other
The observed pattern differs significantly from what we'd expect if variables were independent

If we FAIL TO REJECT H₀ (χ² is small, p-value is large):

Insufficient evidence of association
The data is consistent with independence
We cannot conclude the variables are related

Important Note: Association ≠ Causation! Even if variables are associated, this doesn't prove one causes the other.

Interactive Contingency Table Calculator

Enter your own data to see expected frequencies and χ² calculation

Key Takeaways

Remember These Points

Purpose: Test if two categorical variables are independent or associated
Study design: One sample, two variables
Expected frequency: E = (Row Total × Column Total) / Grand Total
Test statistic: χ² = Σ[(O - E)² / E] summed over ALL cells
Degrees of freedom: df = (r - 1)(c - 1)
Condition: All expected frequencies must be ≥ 5
Always right-tailed: Large χ² indicates association
Interpretation: Rejected H₀ = variables are associated; Failed to reject = no evidence of association

Continue to Lesson 3: Test of Homogeneity →

← Previous: Lesson 1 (Goodness of Fit)

← Back to Module 12 Overview