Lesson 1: Introduction to Chi-Square & Goodness of Fit
Understanding the Chi-Square Distribution and Testing if Data Fits an Expected Distribution
Learning Objectives
By the end of this lesson, you will be able to:
- Understand the properties of the chi-square (χ²) distribution
- Distinguish between categorical and quantitative data
- Perform a chi-square goodness of fit test
- Calculate expected frequencies and the chi-square test statistic
- Determine degrees of freedom for goodness of fit tests
- Interpret results and draw conclusions about whether data fits an expected distribution
Review: Categorical vs. Quantitative Data
Before diving into chi-square tests, let's review a fundamental distinction:
Categorical Data
Categories or groups
Examples: Color (red, blue, green), Political party, Gender, Yes/No responses
Chi-square tests analyze this type!
Quantitative Data
Numerical measurements
Examples: Height, weight, test scores, temperature
Use t-tests, ANOVA, regression for this type
Key Concept
Chi-square tests are designed specifically for analyzing categorical data. When you have counts or frequencies in different categories, chi-square tests help you determine if the observed pattern differs significantly from what you'd expect.
The Chi-Square (χ²) Distribution
The chi-square distribution is a family of probability distributions that forms the foundation for chi-square tests. Understanding its properties helps us interpret test results correctly.
Always Positive
χ² ≥ 0
Values cannot be negative
Right-Skewed
Tail extends to the right
Most values near zero
Shape Varies
Determined by degrees of freedom (df)
More df → more symmetric
Chi-Square Distribution Shapes
The distribution becomes more symmetric as degrees of freedom increase
Chi-Square Goodness of Fit Test
The chi-square goodness of fit test determines whether sample data fits an expected or hypothesized distribution.
When to Use Goodness of Fit Test
- One categorical variable
- One sample
- Purpose: Test if the observed frequencies match expected frequencies
Research question pattern: "Does the distribution of [variable] match [expected distribution]?"
Step-by-Step Procedure
Step 1: State the Hypotheses
H₀ (Null Hypothesis): The data follows the specified distribution
Hₐ (Alternative Hypothesis): The data does NOT follow the specified distribution
Step 2: Check Conditions
- Random sampling
- Independent observations
- All expected frequencies ≥ 5 (crucial condition!)
Step 3: Calculate Expected Frequencies
Expected Frequency Formula:
Where: n = total sample size, p = expected proportion for that category
Step 4: Calculate the Test Statistic
Chi-Square Test Statistic:
Where: O = observed frequency, E = expected frequency
Sum over all categories
Step 5: Find Degrees of Freedom
Degrees of Freedom:
Where: k = number of categories
Step 6: Find the p-value or Critical Value
Use chi-square distribution table or technology with df from Step 5
Decision rule: Reject H₀ if χ² > critical value OR if p-value < α
Step 7: Make a Conclusion
State your conclusion in context of the problem
Complete Example: Testing a Fair Die
Example 1: Is the Die Fair?
Problem: A gambler suspects a six-sided die might be unfair. They roll the die 120 times and record the outcomes:
| Outcome | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Observed Frequency | 15 | 18 | 22 | 17 | 26 | 22 |
Question: At the α = 0.05 significance level, is there evidence that the die is unfair?
Solution:
Step 1: State the hypotheses
- H₀: The die is fair (all outcomes are equally likely, p = 1/6 for each)
- Hₐ: The die is not fair (outcomes are not equally likely)
Step 2: Check conditions
- Assume the rolls are independent
- We'll check if all expected frequencies ≥ 5
Step 3: Calculate expected frequencies
For a fair die, each outcome should appear 1/6 of the time:
E = n × p = 120 × (1/6) = 20 for each outcome
| Outcome | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Observed (O) | 15 | 18 | 22 | 17 | 26 | 22 |
| Expected (E) | 20 | 20 | 20 | 20 | 20 | 20 |
All expected frequencies are 20 ≥ 5, so the condition is met!
Step 4: Calculate the chi-square test statistic
χ² = Σ[(O - E)² / E]
| Outcome | O | E | O - E | (O - E)² | (O - E)²/E |
|---|---|---|---|---|---|
| 1 | 15 | 20 | -5 | 25 | 1.25 |
| 2 | 18 | 20 | -2 | 4 | 0.20 |
| 3 | 22 | 20 | 2 | 4 | 0.20 |
| 4 | 17 | 20 | -3 | 9 | 0.45 |
| 5 | 26 | 20 | 6 | 36 | 1.80 |
| 6 | 22 | 20 | 2 | 4 | 0.20 |
| Total | 120 | 120 | 4.10 |
χ² = 4.10
Step 5: Find degrees of freedom
df = k - 1 = 6 - 1 = 5
Step 6: Find p-value or critical value
Using a chi-square table with df = 5 and α = 0.05:
Critical value = 11.07
Since our χ² = 4.10 < 11.07, we fail to reject H₀
(Alternatively, p-value ≈ 0.536 > 0.05, so we fail to reject H₀)
Step 7: Conclusion
At the 0.05 significance level, there is insufficient evidence to conclude that the die is unfair. The observed frequencies are consistent with a fair die.
Check Your Understanding
Question: In the die example above, which category contributed most to the chi-square statistic?
Answer: B) Outcome 5 (contributed 1.80)
Explanation: Outcome 5 had the largest deviation from expected (26 observed vs 20 expected), which resulted in the largest contribution (1.80) to the total chi-square statistic. The larger the contribution, the more that category differs from what we'd expect.
Another Example: M&M Colors
Example 2: M&M Color Distribution
Problem: According to Mars, Inc., M&M's should be distributed by color as follows:
- Brown: 13%
- Yellow: 14%
- Red: 13%
- Blue: 24%
- Orange: 20%
- Green: 16%
A bag of 200 M&M's contains:
| Color | Brown | Yellow | Red | Blue | Orange | Green |
|---|---|---|---|---|---|---|
| Observed | 20 | 32 | 28 | 45 | 42 | 33 |
Question: At α = 0.05, does this bag's distribution differ from the claimed distribution?
Complete Solution:
Step 1: Hypotheses
- H₀: The color distribution matches the claimed percentages
- Hₐ: The color distribution does NOT match the claimed percentages
Step 3: Expected frequencies
| Color | Proportion (p) | Expected (E = 200 × p) |
|---|---|---|
| Brown | 0.13 | 26 |
| Yellow | 0.14 | 28 |
| Red | 0.13 | 26 |
| Blue | 0.24 | 48 |
| Orange | 0.20 | 40 |
| Green | 0.16 | 32 |
All expected frequencies ≥ 5
Step 4: Calculate χ²
| Color | O | E | (O-E)²/E |
|---|---|---|---|
| Brown | 20 | 26 | 1.385 |
| Yellow | 32 | 28 | 0.571 |
| Red | 28 | 26 | 0.154 |
| Blue | 45 | 48 | 0.188 |
| Orange | 42 | 40 | 0.100 |
| Green | 33 | 32 | 0.031 |
| Total | 2.429 |
χ² = 2.429
Step 5: Degrees of freedom
df = 6 - 1 = 5
Step 6: Critical value
With df = 5 and α = 0.05, critical value = 11.07
Since 2.429 < 11.07, we fail to reject H₀
(p-value ≈ 0.787)
Step 7: Conclusion
At the 0.05 significance level, there is insufficient evidence to conclude that the color distribution differs from the claimed distribution. The bag's colors are consistent with Mars' stated percentages.
Check Your Understanding
Question: A researcher wants to test if births are evenly distributed across the 7 days of the week. They collect data on 350 births. What are the expected frequencies if births are truly evenly distributed?
Answer: A) 50 births per day
Explanation: If births are evenly distributed across 7 days, each day should have the same proportion (1/7). Expected frequency = n × p = 350 × (1/7) = 50 births per day. This is what we'd expect under the null hypothesis of equal distribution.
Key Takeaways
Remember These Points
- Purpose: Goodness of fit tests whether observed frequencies match expected frequencies
- Formula: χ² = Σ[(O - E)² / E]
- Degrees of freedom: df = k - 1 (number of categories minus 1)
- Expected frequency: E = n × p (sample size × proportion)
- Condition: All expected frequencies must be ≥ 5
- Rejection region: Always right-tailed (large χ² = poor fit)
- Interpretation: Large χ² means observed differs from expected; small χ² means good fit