Lesson 3: Hypothesis Testing in Regression
Testing for significant linear relationships and checking regression conditions
Home > Intro Stats > Module 11 > Lesson 3
Learning Objectives
By the end of this lesson, you will be able to:
- Conduct a hypothesis test for population correlation (ρ)
- Conduct a hypothesis test for population slope (β₁)
- Understand the equivalence of testing ρ = 0 and β₁ = 0
- Check the LINE conditions for regression inference
- Create and interpret residual plots
- Identify violations of regression assumptions
- Make decisions about the significance of linear relationships
1. Introduction to Regression Inference
In Lessons 1-2, we calculated sample statistics (r, b₀, b₁, r²) to describe the relationship in our specific dataset. But we often want to make inferences about the population:
- Is there really a linear relationship in the population, or did we just observe one by chance in our sample?
- Can we conclude that the apparent relationship is statistically significant?
To answer these questions, we use hypothesis testing!
Sample vs Population Parameters
| Parameter | Population (Unknown) | Sample (Known) |
|---|---|---|
| Correlation | ρ (rho) | r |
| Slope | β₁ (beta-one) | b₁ |
| Intercept | β₀ (beta-zero) | b₀ |
We use the sample statistics (r, b₁) to make inferences about the population parameters (ρ, β₁).
2. Hypothesis Test for Correlation (ρ)
To test whether there's a significant linear relationship in the population, we test whether the population correlation ρ equals zero.
Hypothesis Test for Correlation
Hypotheses:
- H₀: ρ = 0 (No linear relationship in the population)
- Hₐ: ρ ≠ 0 (There is a linear relationship in the population)
Note: We can also use one-sided tests (Hₐ: ρ > 0 or Hₐ: ρ < 0) if we have prior reason to expect a specific direction.
Test Statistic for Correlation
Degrees of freedom: df = n - 2
Where:
- r = sample correlation coefficient
- n = sample size
Decision:
- Compare t to critical value from t-distribution with df = n - 2, OR
- Use p-value: If p-value < α, reject H₀
Complete Example: Testing Correlation
Research Question: Is there a significant linear relationship between study hours and test scores?
Data: n = 20 students, r = 0.65
Significance level: α = 0.05
Step 1: State Hypotheses
- H₀: ρ = 0 (No linear relationship between study hours and test scores)
- Hₐ: ρ ≠ 0 (There is a linear relationship between study hours and test scores)
Step 2: Check Conditions (see Section 4 below)
Assume conditions are met for this example.
Step 3: Calculate Test Statistic
t = r√[(n-2)/(1-r²)]
t = 0.65√[(20-2)/(1-0.65²)]
t = 0.65√[18/(1-0.4225)]
t = 0.65√[18/0.5775]
t = 0.65√31.17
t = 0.65 × 5.58
t ≈ 3.63
Degrees of freedom: df = 20 - 2 = 18
Step 4: Find P-value
Using t-distribution table or technology with df = 18, two-tailed test:
p-value ≈ 0.002
Step 5: Make Decision
Since p-value (0.002) < α (0.05), we reject H₀.
Step 6: State Conclusion
There is sufficient evidence to conclude that there is a significant linear relationship between study hours and test scores (p = 0.002).
In other words, the correlation we observed (r = 0.65) is statistically significant and unlikely to be due to chance alone.
3. Hypothesis Test for Slope (β₁)
An equivalent way to test for a linear relationship is to test whether the population slope β₁ equals zero.
Hypothesis Test for Slope
Hypotheses:
- H₀: β₁ = 0 (Slope is zero; no linear relationship)
- Hₐ: β₁ ≠ 0 (Slope is not zero; there is a linear relationship)
Test Statistic for Slope
Where:
- b₁ = sample slope
- sₑ = standard error of estimate
- Σ(x - x̄)² = sum of squared deviations of x
Degrees of freedom: df = n - 2
Key Insight: Equivalent Tests
Testing ρ = 0 is EXACTLY THE SAME as testing β₁ = 0!
- Same t-value (may look different due to rounding)
- Same p-value
- Same conclusion
Why? If there's no correlation (ρ = 0), the slope must be zero (β₁ = 0), and vice versa. These are two ways of asking the same question: "Is there a linear relationship?"
Example: Testing Slope
Scenario: Predicting house price (in $1000s) from square footage
Data:
- n = 25 houses
- Regression equation: ŷ = 50 + 0.12x (where x = square feet)
- b₁ = 0.12 (thousand $/sq ft)
- SE(b₁) = 0.035
Test at α = 0.05:
Hypotheses
- H₀: β₁ = 0 (Square footage has no effect on price)
- Hₐ: β₁ ≠ 0 (Square footage affects price)
Test Statistic
t = (b₁ - 0) / SE(b₁) = (0.12 - 0) / 0.035 = 3.43
df = 25 - 2 = 23
P-value and Decision
Using t-table with df = 23: p-value ≈ 0.002
Since p < 0.05, reject H₀
Conclusion
There is sufficient evidence to conclude that square footage is a significant predictor of house price (t = 3.43, p = 0.002).
The positive slope (b₁ = 0.12) indicates that larger houses tend to have higher prices.
4. Conditions for Regression Inference (LINE)
Before we can trust the results of regression inference (hypothesis tests or confidence intervals), we must check that certain conditions are met. Remember the acronym LINE:
LINE Conditions for Regression Inference
- L - Linearity: The relationship between x and y is linear
Check: Scatter plot shows a roughly linear pattern (not curved)
- I - Independence: Observations are independent of each other
Check: Based on data collection method (random sampling, no time series)
- N - Normality: The residuals (errors) are approximately normally distributed
Check: Histogram or normal probability plot of residuals looks roughly bell-shaped
Note: This condition is less critical for large samples (n ≥ 30) due to Central Limit Theorem
- E - Equal variance (Homoscedasticity): The spread of residuals is constant across all x values
Check: Residual plot shows random scatter with constant vertical spread (no "fan shape")
Checking Conditions with Residual Plots
The most important diagnostic tool is the residual plot: a scatter plot of residuals vs x (or vs ŷ).
Good vs Bad Residual Plots
| Pattern in Residual Plot | Indication | Condition Violated |
|---|---|---|
| Random scatter around zero with constant spread | Good! Conditions met | None |
| Curved pattern (U-shape, etc.) | Non-linear relationship | Linearity |
| Fan shape (spread increases) | Non-constant variance | Equal Variance |
| Outliers or extreme points | May affect regression; investigate | Potential issue |
What If Conditions Are Violated?
If LINE conditions are not met, the hypothesis test results may be unreliable. Options include:
- Transform variables: Try log(x), √x, x², etc. to linearize the relationship
- Remove outliers: If there are clear data errors or influential outliers (but be careful - don't remove data just because it doesn't fit!)
- Use non-linear regression: If the relationship is genuinely curved
- Collect more data: Larger samples are more robust to condition violations
- Be cautious: Interpret results with appropriate skepticism
Check Your Understanding
Question 1: A researcher finds r = 0.25 with n = 50. Calculate the test statistic for testing H₀: ρ = 0.
Solution:
t = r√[(n-2)/(1-r²)]
t = 0.25√[(50-2)/(1-0.25²)]
t = 0.25√[48/0.9375]
t = 0.25√51.2
t = 0.25 × 7.16
t ≈ 1.79
df = 50 - 2 = 48
Critical value (α = 0.05, two-tailed, df = 48): approximately ±2.01
Since |1.79| < 2.01, we fail to reject H₀. There is insufficient evidence of a significant linear relationship at α = 0.05.
Question 2: Why do we test β₁ = 0 instead of, say, β₁ = 1?
Answer: We test β₁ = 0 because:
- If β₁ = 0, there is NO relationship between x and y. The regression line would be horizontal (flat), meaning changes in x don't affect y.
- If β₁ ≠ 0, there IS a relationship - positive if β₁ > 0, negative if β₁ < 0.
- Testing β₁ = 0 answers the fundamental question: "Is x a useful predictor of y?"
We could test β₁ = some other value if we had a specific theory (e.g., H₀: β₁ = 1 if we think the relationship is one-to-one), but usually we start by testing whether there's any relationship at all.
Question 3: A residual plot shows a clear U-shaped curve. What does this tell you?
Answer: A U-shaped curve in the residual plot indicates that the linearity condition is violated.
Interpretation:
- The relationship between x and y is not linear - it's curved (quadratic or non-linear)
- A straight line is not the best model for this data
- Predictions from the linear regression will be systematically too high or too low depending on the x value
Solution: Consider using a curved model (quadratic regression, polynomial regression) or transforming variables to linearize the relationship.
Question 4: If you reject H₀: ρ = 0, what will happen when you test H₀: β₁ = 0 with the same data?
Answer: You will also reject H₀: β₁ = 0.
Reason: Testing ρ = 0 and testing β₁ = 0 are equivalent tests. They:
- Yield the same (or nearly the same) t-statistic
- Yield the same p-value
- Lead to the same conclusion
Both tests answer the same question: "Is there a significant linear relationship?" If one test rejects the null, the other will too.