Safaa Dabagh

Lesson 1: Introduction to Chi-Square & Goodness of Fit

Understanding the Chi-Square Distribution and Testing if Data Fits an Expected Distribution

Learning Objectives

By the end of this lesson, you will be able to:

Review: Categorical vs. Quantitative Data

Before diving into chi-square tests, let's review a fundamental distinction:

Categorical Data

Categories or groups

Examples: Color (red, blue, green), Political party, Gender, Yes/No responses

Chi-square tests analyze this type!

Quantitative Data

Numerical measurements

Examples: Height, weight, test scores, temperature

Use t-tests, ANOVA, regression for this type

Key Concept

Chi-square tests are designed specifically for analyzing categorical data. When you have counts or frequencies in different categories, chi-square tests help you determine if the observed pattern differs significantly from what you'd expect.

The Chi-Square (χ²) Distribution

The chi-square distribution is a family of probability distributions that forms the foundation for chi-square tests. Understanding its properties helps us interpret test results correctly.

Always Positive

χ² ≥ 0

Values cannot be negative

Right-Skewed

Tail extends to the right

Most values near zero

Shape Varies

Determined by degrees of freedom (df)

More df → more symmetric

Chi-Square Distribution Shapes

The distribution becomes more symmetric as degrees of freedom increase

Chi-Square Goodness of Fit Test

The chi-square goodness of fit test determines whether sample data fits an expected or hypothesized distribution.

When to Use Goodness of Fit Test

Research question pattern: "Does the distribution of [variable] match [expected distribution]?"

Step-by-Step Procedure

Step 1: State the Hypotheses

H₀ (Null Hypothesis): The data follows the specified distribution

Hₐ (Alternative Hypothesis): The data does NOT follow the specified distribution

Step 2: Check Conditions

Step 3: Calculate Expected Frequencies

Expected Frequency Formula:

E = n × p

Where: n = total sample size, p = expected proportion for that category

Step 4: Calculate the Test Statistic

Chi-Square Test Statistic:

χ² = Σ[(O - E)² / E]

Where: O = observed frequency, E = expected frequency

Sum over all categories

Step 5: Find Degrees of Freedom

Degrees of Freedom:

df = k - 1

Where: k = number of categories

Step 6: Find the p-value or Critical Value

Use chi-square distribution table or technology with df from Step 5

Decision rule: Reject H₀ if χ² > critical value OR if p-value < α

Step 7: Make a Conclusion

State your conclusion in context of the problem

Complete Example: Testing a Fair Die

Example 1: Is the Die Fair?

Problem: A gambler suspects a six-sided die might be unfair. They roll the die 120 times and record the outcomes:

Outcome 1 2 3 4 5 6
Observed Frequency 15 18 22 17 26 22

Question: At the α = 0.05 significance level, is there evidence that the die is unfair?

Solution:

Step 1: State the hypotheses

Step 2: Check conditions

Step 3: Calculate expected frequencies

For a fair die, each outcome should appear 1/6 of the time:

E = n × p = 120 × (1/6) = 20 for each outcome

Outcome 1 2 3 4 5 6
Observed (O) 15 18 22 17 26 22
Expected (E) 20 20 20 20 20 20

All expected frequencies are 20 ≥ 5, so the condition is met!

Step 4: Calculate the chi-square test statistic

χ² = Σ[(O - E)² / E]

Outcome O E O - E (O - E)² (O - E)²/E
1 15 20 -5 25 1.25
2 18 20 -2 4 0.20
3 22 20 2 4 0.20
4 17 20 -3 9 0.45
5 26 20 6 36 1.80
6 22 20 2 4 0.20
Total 120 120 4.10

χ² = 4.10

Step 5: Find degrees of freedom

df = k - 1 = 6 - 1 = 5

Step 6: Find p-value or critical value

Using a chi-square table with df = 5 and α = 0.05:

Critical value = 11.07

Since our χ² = 4.10 < 11.07, we fail to reject H₀

(Alternatively, p-value ≈ 0.536 > 0.05, so we fail to reject H₀)

Step 7: Conclusion

At the 0.05 significance level, there is insufficient evidence to conclude that the die is unfair. The observed frequencies are consistent with a fair die.

Check Your Understanding

Question: In the die example above, which category contributed most to the chi-square statistic?

Answer: B) Outcome 5 (contributed 1.80)

Explanation: Outcome 5 had the largest deviation from expected (26 observed vs 20 expected), which resulted in the largest contribution (1.80) to the total chi-square statistic. The larger the contribution, the more that category differs from what we'd expect.

Another Example: M&M Colors

Example 2: M&M Color Distribution

Problem: According to Mars, Inc., M&M's should be distributed by color as follows:

A bag of 200 M&M's contains:

Color Brown Yellow Red Blue Orange Green
Observed 20 32 28 45 42 33

Question: At α = 0.05, does this bag's distribution differ from the claimed distribution?

Complete Solution:

Step 1: Hypotheses

  • H₀: The color distribution matches the claimed percentages
  • Hₐ: The color distribution does NOT match the claimed percentages

Step 3: Expected frequencies

Color Proportion (p) Expected (E = 200 × p)
Brown 0.13 26
Yellow 0.14 28
Red 0.13 26
Blue 0.24 48
Orange 0.20 40
Green 0.16 32

All expected frequencies ≥ 5

Step 4: Calculate χ²

Color O E (O-E)²/E
Brown 20 26 1.385
Yellow 32 28 0.571
Red 28 26 0.154
Blue 45 48 0.188
Orange 42 40 0.100
Green 33 32 0.031
Total 2.429

χ² = 2.429

Step 5: Degrees of freedom

df = 6 - 1 = 5

Step 6: Critical value

With df = 5 and α = 0.05, critical value = 11.07

Since 2.429 < 11.07, we fail to reject H₀

(p-value ≈ 0.787)

Step 7: Conclusion

At the 0.05 significance level, there is insufficient evidence to conclude that the color distribution differs from the claimed distribution. The bag's colors are consistent with Mars' stated percentages.

Check Your Understanding

Question: A researcher wants to test if births are evenly distributed across the 7 days of the week. They collect data on 350 births. What are the expected frequencies if births are truly evenly distributed?

Answer: A) 50 births per day

Explanation: If births are evenly distributed across 7 days, each day should have the same proportion (1/7). Expected frequency = n × p = 350 × (1/7) = 50 births per day. This is what we'd expect under the null hypothesis of equal distribution.

Key Takeaways

Remember These Points

Continue to Lesson 2: Test of Independence →

← Back to Module 12 Overview