Save or print this lesson:

Lesson 1: Introduction to Chi-Square & Goodness of Fit

Understanding the Chi-Square Distribution and Testing if Data Fits an Expected Distribution

Learning Objectives

By the end of this lesson, you will be able to:

Understand the properties of the chi-square (χ²) distribution
Distinguish between categorical and quantitative data
Perform a chi-square goodness of fit test
Calculate expected frequencies and the chi-square test statistic
Determine degrees of freedom for goodness of fit tests
Interpret results and draw conclusions about whether data fits an expected distribution

Review: Categorical vs. Quantitative Data

Before diving into chi-square tests, let's review a fundamental distinction:

Categorical Data

Categories or groups

Examples: Color (red, blue, green), Political party, Gender, Yes/No responses

Chi-square tests analyze this type!

Quantitative Data

Numerical measurements

Examples: Height, weight, test scores, temperature

Use t-tests, ANOVA, regression for this type

Key Concept

Chi-square tests are designed specifically for analyzing categorical data. When you have counts or frequencies in different categories, chi-square tests help you determine if the observed pattern differs significantly from what you'd expect.

The Chi-Square (χ²) Distribution

The chi-square distribution is a family of probability distributions that forms the foundation for chi-square tests. Understanding its properties helps us interpret test results correctly.

Always Positive

χ² ≥ 0

Values cannot be negative

Right-Skewed

Tail extends to the right

Most values near zero

Shape Varies

Determined by degrees of freedom (df)

More df → more symmetric

Chi-Square Distribution Shapes

The distribution becomes more symmetric as degrees of freedom increase

Chi-Square Goodness of Fit Test

The chi-square goodness of fit test determines whether sample data fits an expected or hypothesized distribution.

When to Use Goodness of Fit Test

One categorical variable
One sample
Purpose: Test if the observed frequencies match expected frequencies

Research question pattern: "Does the distribution of [variable] match [expected distribution]?"

Step-by-Step Procedure

Step 1: State the Hypotheses

H₀ (Null Hypothesis): The data follows the specified distribution

Hₐ (Alternative Hypothesis): The data does NOT follow the specified distribution

Step 2: Check Conditions

Random sampling
Independent observations
All expected frequencies ≥ 5 (crucial condition!)

Step 3: Calculate Expected Frequencies

Expected Frequency Formula:

E = n × p

Where: n = total sample size, p = expected proportion for that category

Step 4: Calculate the Test Statistic

Chi-Square Test Statistic:

χ² = Σ[(O - E)² / E]

Where: O = observed frequency, E = expected frequency

Sum over all categories

Step 5: Find Degrees of Freedom

Degrees of Freedom:

df = k - 1

Where: k = number of categories

Step 6: Find the p-value or Critical Value

Use chi-square distribution table or technology with df from Step 5

Decision rule: Reject H₀ if χ² > critical value OR if p-value < α

Step 7: Make a Conclusion

State your conclusion in context of the problem

Complete Example: Testing a Fair Die

Example 1: Is the Die Fair?

Problem: A gambler suspects a six-sided die might be unfair. They roll the die 120 times and record the outcomes:

Outcome	1	2	3	4	5	6
Observed Frequency	15	18	22	17	26	22

Question: At the α = 0.05 significance level, is there evidence that the die is unfair?

Solution:

Step 1: State the hypotheses

H₀: The die is fair (all outcomes are equally likely, p = 1/6 for each)
Hₐ: The die is not fair (outcomes are not equally likely)

Step 2: Check conditions

Assume the rolls are independent
We'll check if all expected frequencies ≥ 5

Step 3: Calculate expected frequencies

For a fair die, each outcome should appear 1/6 of the time:

E = n × p = 120 × (1/6) = 20 for each outcome

Outcome	1	2	3	4	5	6
Observed (O)	15	18	22	17	26	22
Expected (E)	20	20	20	20	20	20

All expected frequencies are 20 ≥ 5, so the condition is met!

Step 4: Calculate the chi-square test statistic

χ² = Σ[(O - E)² / E]

Outcome	O	E	O - E	(O - E)²	(O - E)²/E
1	15	20	-5	25	1.25
2	18	20	-2	4	0.20
3	22	20	2	4	0.20
4	17	20	-3	9	0.45
5	26	20	6	36	1.80
6	22	20	2	4	0.20
Total	120	120			4.10

χ² = 4.10

Step 5: Find degrees of freedom

df = k - 1 = 6 - 1 = 5

Step 6: Find p-value or critical value

Using a chi-square table with df = 5 and α = 0.05:

Critical value = 11.07

Since our χ² = 4.10 < 11.07, we fail to reject H₀

(Alternatively, p-value ≈ 0.536 > 0.05, so we fail to reject H₀)

Step 7: Conclusion

At the 0.05 significance level, there is insufficient evidence to conclude that the die is unfair. The observed frequencies are consistent with a fair die.

Check Your Understanding

Question: In the die example above, which category contributed most to the chi-square statistic?

Answer: B) Outcome 5 (contributed 1.80)

Explanation: Outcome 5 had the largest deviation from expected (26 observed vs 20 expected), which resulted in the largest contribution (1.80) to the total chi-square statistic. The larger the contribution, the more that category differs from what we'd expect.

Another Example: M&M Colors

Example 2: M&M Color Distribution

Problem: According to Mars, Inc., M&M's should be distributed by color as follows:

Brown: 13%
Yellow: 14%
Red: 13%
Blue: 24%
Orange: 20%
Green: 16%

A bag of 200 M&M's contains:

Color	Brown	Yellow	Red	Blue	Orange	Green
Observed	20	32	28	45	42	33

Question: At α = 0.05, does this bag's distribution differ from the claimed distribution?

Complete Solution:

Step 1: Hypotheses

H₀: The color distribution matches the claimed percentages
Hₐ: The color distribution does NOT match the claimed percentages

Step 3: Expected frequencies

Color	Proportion (p)	Expected (E = 200 × p)
Brown	0.13	26
Yellow	0.14	28
Red	0.13	26
Blue	0.24	48
Orange	0.20	40
Green	0.16	32

All expected frequencies ≥ 5

Step 4: Calculate χ²

Color	O	E	(O-E)²/E
Brown	20	26	1.385
Yellow	32	28	0.571
Red	28	26	0.154
Blue	45	48	0.188
Orange	42	40	0.100
Green	33	32	0.031
Total			2.429

χ² = 2.429

Step 5: Degrees of freedom

df = 6 - 1 = 5

Step 6: Critical value

With df = 5 and α = 0.05, critical value = 11.07

Since 2.429 < 11.07, we fail to reject H₀

(p-value ≈ 0.787)

Step 7: Conclusion

At the 0.05 significance level, there is insufficient evidence to conclude that the color distribution differs from the claimed distribution. The bag's colors are consistent with Mars' stated percentages.

Check Your Understanding

Question: A researcher wants to test if births are evenly distributed across the 7 days of the week. They collect data on 350 births. What are the expected frequencies if births are truly evenly distributed?

Answer: A) 50 births per day

Explanation: If births are evenly distributed across 7 days, each day should have the same proportion (1/7). Expected frequency = n × p = 350 × (1/7) = 50 births per day. This is what we'd expect under the null hypothesis of equal distribution.

Key Takeaways

Remember These Points

Purpose: Goodness of fit tests whether observed frequencies match expected frequencies
Formula: χ² = Σ[(O - E)² / E]
Degrees of freedom: df = k - 1 (number of categories minus 1)
Expected frequency: E = n × p (sample size × proportion)
Condition: All expected frequencies must be ≥ 5
Rejection region: Always right-tailed (large χ² = poor fit)
Interpretation: Large χ² means observed differs from expected; small χ² means good fit

Continue to Lesson 2: Test of Independence →

← Back to Module 12 Overview