Lesson 2: Chi-Square Test of Independence
Testing Whether Two Categorical Variables Are Associated
Learning Objectives
By the end of this lesson, you will be able to:
- Understand when to use the chi-square test of independence
- Create and interpret contingency tables (two-way tables)
- Calculate expected frequencies for independence tests
- Determine degrees of freedom for contingency tables
- Perform a complete chi-square test of independence
- Interpret results about the relationship between two categorical variables
What is Independence?
In statistics, two categorical variables are independent if knowing the category of one variable doesn't give us any information about the other variable.
Independent Variables
No association or relationship
Example: Hair color and favorite ice cream flavor are likely independent - knowing someone's hair color doesn't help predict their ice cream preference.
Dependent (Associated) Variables
There IS an association
Example: Education level and income are likely dependent - higher education tends to be associated with higher income.
When to Use Test of Independence
- Two categorical variables
- One sample classified by both variables
- Purpose: Test if the two variables are independent or associated
Research question pattern: "Is there a relationship between [variable 1] and [variable 2]?"
Contingency Tables (Two-Way Tables)
A contingency table (also called a two-way table) displays the frequency distribution of two categorical variables:
| Variable 2 → | |||
|---|---|---|---|
| Variable 1 ↓ | Category A | Category B | Row Total |
| Category 1 | Count | Count | Total |
| Category 2 | Count | Count | Total |
| Column Total | Total | Total | Grand Total |
Step-by-Step Procedure
Step 1: State the Hypotheses
H₀ (Null Hypothesis): The two variables are independent (no association)
Hₐ (Alternative Hypothesis): The two variables are dependent (there is an association)
Step 2: Check Conditions
- Random sampling
- Independent observations
- All expected frequencies ≥ 5
Step 3: Calculate Expected Frequencies
Expected Frequency Formula for Each Cell:
Calculate this for EVERY cell in the table
Step 4: Calculate the Test Statistic
Chi-Square Test Statistic:
Sum over ALL cells in the contingency table
Step 5: Find Degrees of Freedom
Degrees of Freedom:
Where: r = number of rows, c = number of columns
Step 6: Find the p-value or Critical Value
Use chi-square distribution table or technology
Step 7: Make a Conclusion
State your conclusion about whether the variables are independent or associated
Complete Example: Gender and Political Party
Example 1: Gender and Political Affiliation
Problem: A political scientist surveys 500 randomly selected voters and records their gender and political party preference. The data is shown below:
| Democrat | Republican | Independent | Row Total | |
|---|---|---|---|---|
| Male | 120 | 90 | 40 | 250 |
| Female | 100 | 110 | 40 | 250 |
| Column Total | 220 | 200 | 80 | 500 |
Question: At the α = 0.05 significance level, is there evidence that gender and political party preference are associated?
Solution:
Step 1: State the hypotheses
- H₀: Gender and political party preference are independent
- Hₐ: Gender and political party preference are dependent (associated)
Step 2: Check conditions
- Random sample of 500 voters
- Observations are independent
- We'll check if all expected frequencies ≥ 5
Step 3: Calculate expected frequencies
For each cell: E = (Row Total × Column Total) / Grand Total
Expected frequency calculations:
- Male & Democrat: E = (250 × 220) / 500 = 110
- Male & Republican: E = (250 × 200) / 500 = 100
- Male & Independent: E = (250 × 80) / 500 = 40
- Female & Democrat: E = (250 × 220) / 500 = 110
- Female & Republican: E = (250 × 200) / 500 = 100
- Female & Independent: E = (250 × 80) / 500 = 40
| Democrat | Republican | Independent | |
|---|---|---|---|
| Male | O=120, E=110 | O=90, E=100 | O=40, E=40 |
| Female | O=100, E=110 | O=110, E=100 | O=40, E=40 |
All expected frequencies are ≥ 5!
Step 4: Calculate the chi-square test statistic
χ² = Σ[(O - E)² / E] for all 6 cells:
| Cell | O | E | (O - E)²/E |
|---|---|---|---|
| Male & Democrat | 120 | 110 | 0.909 |
| Male & Republican | 90 | 100 | 1.000 |
| Male & Independent | 40 | 40 | 0.000 |
| Female & Democrat | 100 | 110 | 0.909 |
| Female & Republican | 110 | 100 | 1.000 |
| Female & Independent | 40 | 40 | 0.000 |
| Total χ² | 3.818 | ||
χ² = 3.818
Step 5: Find degrees of freedom
df = (r - 1)(c - 1) = (2 - 1)(3 - 1) = 1 × 2 = 2
Step 6: Find p-value or critical value
Using a chi-square table with df = 2 and α = 0.05:
Critical value = 5.991
Since our χ² = 3.818 < 5.991, we fail to reject H₀
(Alternatively, p-value ≈ 0.148 > 0.05)
Step 7: Conclusion
At the 0.05 significance level, there is insufficient evidence to conclude that gender and political party preference are associated. The data are consistent with the two variables being independent.
Check Your Understanding
Question: In the political party example, what does it mean practically that we failed to reject the null hypothesis?
Answer: C) We don't have enough evidence to say gender and political party are related
Explanation: Failing to reject H₀ means we don't have sufficient evidence of an association. It does NOT prove independence - it just means our data is consistent with independence. The distinction is important: we never "prove" the null hypothesis, we simply lack evidence against it.
Another Example: 2×2 Table
Example 2: Treatment and Outcome
Problem: A medical researcher randomly assigns 200 patients with migraines to either a new drug treatment or a placebo. After one month, they record whether each patient experienced significant improvement:
| Improved | Not Improved | Row Total | |
|---|---|---|---|
| Drug | 70 | 30 | 100 |
| Placebo | 50 | 50 | 100 |
| Column Total | 120 | 80 | 200 |
Question: At α = 0.01, is there evidence that treatment type and improvement are associated?
Complete Solution:
Step 1: Hypotheses
- H₀: Treatment type and improvement are independent
- Hₐ: Treatment type and improvement are associated
Step 3: Expected frequencies
| Cell | Calculation | Expected |
|---|---|---|
| Drug & Improved | (100 × 120) / 200 | 60 |
| Drug & Not Improved | (100 × 80) / 200 | 40 |
| Placebo & Improved | (100 × 120) / 200 | 60 |
| Placebo & Not Improved | (100 × 80) / 200 | 40 |
All expected frequencies ≥ 5
Step 4: Calculate χ²
| Cell | O | E | (O-E)²/E |
|---|---|---|---|
| Drug & Improved | 70 | 60 | 1.667 |
| Drug & Not Improved | 30 | 40 | 2.500 |
| Placebo & Improved | 50 | 60 | 1.667 |
| Placebo & Not Improved | 50 | 40 | 2.500 |
| Total | 8.334 | ||
χ² = 8.334
Step 5: Degrees of freedom
df = (2 - 1)(2 - 1) = 1 × 1 = 1
Step 6: Critical value
With df = 1 and α = 0.01, critical value = 6.635
Since 8.334 > 6.635, we REJECT H₀
(p-value ≈ 0.0039 < 0.01)
Step 7: Conclusion
At the 0.01 significance level, there is sufficient evidence to conclude that treatment type and improvement are associated. The drug appears to be more effective than the placebo.
Check Your Understanding
Question: What would be the degrees of freedom for a contingency table with 4 rows and 5 columns?
Answer: B) 12
Explanation: df = (r - 1)(c - 1) = (4 - 1)(5 - 1) = 3 × 4 = 12. We subtract 1 from both the number of rows and columns, then multiply.
Interpreting Results
What Does the Test Tell Us?
If we REJECT H₀ (χ² is large, p-value is small):
- The two variables are associated (dependent)
- Knowing one variable gives information about the other
- The observed pattern differs significantly from what we'd expect if variables were independent
If we FAIL TO REJECT H₀ (χ² is small, p-value is large):
- Insufficient evidence of association
- The data is consistent with independence
- We cannot conclude the variables are related
Important Note: Association ≠ Causation! Even if variables are associated, this doesn't prove one causes the other.
Interactive Contingency Table Calculator
Enter your own data to see expected frequencies and χ² calculation
Key Takeaways
Remember These Points
- Purpose: Test if two categorical variables are independent or associated
- Study design: One sample, two variables
- Expected frequency: E = (Row Total × Column Total) / Grand Total
- Test statistic: χ² = Σ[(O - E)² / E] summed over ALL cells
- Degrees of freedom: df = (r - 1)(c - 1)
- Condition: All expected frequencies must be ≥ 5
- Always right-tailed: Large χ² indicates association
- Interpretation: Rejected H₀ = variables are associated; Failed to reject = no evidence of association