Lesson 2: The Regression Equation
Finding the best-fit line, making predictions, and understanding residuals
Home > Intro Stats > Module 11 > Lesson 2
Learning Objectives
By the end of this lesson, you will be able to:
- Understand the simple linear regression model: ŷ = b₀ + b₁x
- Calculate the slope (b₁) and y-intercept (b₀) of the regression line
- Interpret the slope and intercept in context
- Use the regression equation to make predictions
- Calculate and interpret residuals
- Understand the standard error of the estimate (sₑ)
- Calculate and interpret the coefficient of determination (r²)
- Distinguish between interpolation and extrapolation
1. Simple Linear Regression Model
In Lesson 1, we used scatter plots to visualize relationships and the correlation coefficient (r) to measure their strength. But what if we want to predict the value of y for a given x?
That's where regression comes in!
Simple Linear Regression
Simple linear regression finds the "best-fit" straight line through the data that we can use to predict y from x.
The Regression Equation
Where:
- ŷ (y-hat) = predicted value of y
- b₀ = y-intercept (predicted y when x = 0)
- b₁ = slope (change in ŷ for each 1-unit increase in x)
- x = value of the explanatory variable
Regression Line Through Data
The Least-Squares Method
How do we find the "best" line? We use the least-squares method, which finds the line that minimizes the sum of squared residuals.
Least-Squares Criterion
The least-squares regression line is the line that makes the sum of squared vertical distances from points to the line as small as possible.
Minimize: Σ(y - ŷ)²
This line has special properties:
- It always passes through the point (x̄, ȳ)
- The sum of residuals always equals zero: Σ(y - ŷ) = 0
- It's the unique line that minimizes prediction error
2. Calculating Slope and Intercept
Formulas for Slope and Intercept
Slope (b₁):
OR equivalently:
Y-intercept (b₀):
Where:
- r = correlation coefficient
- sₓ = standard deviation of x
- sᵧ = standard deviation of y
- x̄ = mean of x, ȳ = mean of y
Interpreting Slope and Intercept
Interpreting the Slope (b₁)
Template: "For each 1-[unit of x] increase in [x variable], [y variable] increases/decreases by [|b₁|] [units of y] on average."
Examples:
- If b₁ = 5.2 for predicting test score from study hours: "For each additional hour of study, the test score increases by 5.2 points on average."
- If b₁ = -3000 for predicting car value from age: "For each additional year of age, the car's value decreases by $3,000 on average."
Interpreting the Y-intercept (b₀)
Template: "When [x variable] = 0, the predicted [y variable] is [b₀] [units of y]."
Warning: The intercept is only meaningful if x = 0 is within the range of the data and makes sense contextually. Otherwise, it's just a mathematical necessity for the equation.
Examples:
- If b₀ = 50 for predicting test score from study hours: "When study hours = 0, the predicted test score is 50 points." (This makes sense - some baseline score without studying)
- If b₀ = -100 for predicting weight from height: "When height = 0 inches, predicted weight is -100 pounds." (This is nonsense! Height of 0 is not in our data range. The intercept is just mathematical.)
Complete Example: Finding the Regression Equation
Let's find the regression equation for predicting test scores from study hours using our data from Lesson 1:
| Study Hours (x) | Test Score (y) |
|---|---|
| 2 | 65 |
| 4 | 75 |
| 6 | 80 |
| 8 | 90 |
| 10 | 95 |
From Lesson 1, we calculated:
- r = 0.993 (correlation coefficient)
- x̄ = 6 hours
- ȳ = 81 points
Step 1: Calculate standard deviations
- sₓ = √[Σ(x - x̄)² / (n-1)] = √(40/4) = √10 ≈ 3.162 hours
- sᵧ = √[Σ(y - ȳ)² / (n-1)] = √(570/4) = √142.5 ≈ 11.94 points
Step 2: Calculate slope
b₁ = r × (sᵧ / sₓ) = 0.993 × (11.94 / 3.162) = 0.993 × 3.775 ≈ 3.75 points/hour
Step 3: Calculate intercept
b₀ = ȳ - b₁x̄ = 81 - (3.75)(6) = 81 - 22.5 = 58.5 points
Step 4: Write the equation
Interpretation:
- Slope (3.75): For each additional hour of study, the test score increases by 3.75 points on average.
- Intercept (58.5): When study hours = 0, the predicted test score is 58.5 points.
3. Making Predictions with the Regression Equation
Once we have the regression equation, we can use it to predict y for any value of x by simply plugging in the x value.
Example: Making Predictions
Using our equation: ŷ = 58.5 + 3.75x
Question 1: Predict the test score for a student who studies 7 hours.
Solution:
ŷ = 58.5 + 3.75(7) = 58.5 + 26.25 = 84.75 points
Interpretation: We predict that a student who studies 7 hours will score approximately 84.75 points on the test.
Question 2: Predict the test score for a student who studies 5 hours.
Solution:
ŷ = 58.5 + 3.75(5) = 58.5 + 18.75 = 77.25 points
Question 3: Predict the test score for a student who studies 15 hours.
Solution:
ŷ = 58.5 + 3.75(15) = 58.5 + 56.25 = 114.75 points
Problem! This prediction is 114.75 points, but test scores can't exceed 100! This is an example of extrapolation - predicting outside the data range.
Interpolation vs Extrapolation
Interpolation vs Extrapolation
Interpolation: Making predictions within the range of x values in the data
- Generally safe and reliable
- Example: Our data ranges from x = 2 to x = 10, so predicting at x = 7 is interpolation
Extrapolation: Making predictions outside the range of x values in the data
- Risky and potentially unreliable
- Assumes the linear relationship continues beyond the observed data
- Can lead to nonsensical predictions
- Example: Our data ranges from x = 2 to x = 10, so predicting at x = 15 is extrapolation
Dangers of Extrapolation
Avoid extrapolation whenever possible! Relationships that are linear within the observed range may become non-linear or break down entirely outside that range.
Examples of extrapolation gone wrong:
- Predicting a 200-inch person's weight (height-weight relationship changes at extremes)
- Predicting test scores for 20 hours of study (diminishing returns, fatigue)
- Predicting house price for a 50-bedroom house (relationship breaks down at extremes)
4. Residuals
Predictions are rarely perfect. The difference between the actual y value and the predicted ŷ value is called a residual.
Residuals
A residual is the difference between the observed value and the predicted value:
- Positive residual: Actual y is above the line (prediction was too low)
- Negative residual: Actual y is below the line (prediction was too high)
- Zero residual: Point falls exactly on the line (perfect prediction)
Visualizing Residuals
Example: Calculating Residuals
Using our equation ŷ = 58.5 + 3.75x, let's calculate residuals for our five data points:
| x (hours) | y (actual score) | ŷ (predicted) | Residual (y - ŷ) | Interpretation |
|---|---|---|---|---|
| 2 | 65 | 58.5 + 3.75(2) = 66 | 65 - 66 = -1 | Slightly below predicted |
| 4 | 75 | 58.5 + 3.75(4) = 73.5 | 75 - 73.5 = +1.5 | Slightly above predicted |
| 6 | 80 | 58.5 + 3.75(6) = 81 | 80 - 81 = -1 | Slightly below predicted |
| 8 | 90 | 58.5 + 3.75(8) = 88.5 | 90 - 88.5 = +1.5 | Slightly above predicted |
| 10 | 95 | 58.5 + 3.75(10) = 96 | 95 - 96 = -1 | Slightly below predicted |
| Sum of residuals: | 0 | Always! | ||
Note: The sum of residuals always equals 0 for the least-squares line. This is a mathematical property of the least-squares method.
Standard Error of the Estimate (sₑ)
While individual residuals tell us about specific points, the standard error measures the typical size of prediction errors.
Standard Error of the Estimate
This measures the typical distance points fall from the regression line.
- Smaller sₑ = better fit (predictions more accurate)
- Larger sₑ = poorer fit (more prediction error)
- Units are the same as y
- We use (n - 2) in the denominator because we estimate 2 parameters (b₀ and b₁)
5. Coefficient of Determination (r²)
The coefficient of determination, r², tells us how much of the variation in y is explained by x.
Coefficient of Determination (r²)
Interpretation: r² is the proportion of variation in y explained by the linear relationship with x.
- Range: 0 ≤ r² ≤ 1 (or 0% to 100%)
- r² = 0: x explains none of the variation in y
- r² = 1: x explains all of the variation in y (perfect fit)
- r² = 0.75: x explains 75% of the variation in y; 25% is unexplained
Note: r² is always positive (even when r is negative) because it's a squared value.
Example: Calculating and Interpreting r²
From our study hours example:
- r = 0.993
- r² = (0.993)² = 0.986 = 98.6%
Interpretation: 98.6% of the variation in test scores is explained by the linear relationship with study hours. Only 1.4% of the variation is due to other factors (aptitude, prior knowledge, sleep, etc.).
This tells us: Study hours is an excellent predictor of test scores in this dataset!
More r² Examples
Example 1: Height and weight, r = 0.70
- r² = (0.70)² = 0.49 = 49%
- Interpretation: 49% of variation in weight is explained by height. 51% is due to other factors (muscle mass, diet, genetics, etc.).
Example 2: Car age and value, r = -0.90
- r² = (-0.90)² = 0.81 = 81%
- Interpretation: 81% of variation in car value is explained by age. 19% is due to other factors (condition, mileage, make/model, etc.).
- Note: r² is positive even though r is negative!
Example 3: Shoe size and salary, r = 0.25
- r² = (0.25)² = 0.0625 = 6.25%
- Interpretation: Only 6.25% of variation in salary is "explained" by shoe size. This weak relationship is likely spurious or due to confounding variables.
Check Your Understanding
Question 1: The regression equation for predicting car price (in thousands) from age (in years) is ŷ = 25 - 2.5x. Interpret the slope.
Slope = -2.5
Interpretation: For each additional year of age, the car's price decreases by $2,500 on average (or 2.5 thousand dollars).
The negative sign indicates that price decreases as age increases.
Question 2: Using the equation from Question 1, predict the price of a 6-year-old car.
Solution:
ŷ = 25 - 2.5(6) = 25 - 15 = 10 thousand dollars = $10,000
Interpretation: We predict a 6-year-old car will be worth approximately $10,000.
Question 3: If a 6-year-old car is actually worth $12,000, what is the residual?
Solution:
Residual = y - ŷ = 12 - 10 = +2 thousand dollars
Interpretation: The actual price is $2,000 higher than predicted. This car is worth more than the model predicts (perhaps it's well-maintained, low mileage, or a desirable model).
The positive residual means the point is above the regression line.
Question 4: If r = 0.60 for the relationship between exercise hours and weight loss, what is r²? Interpret it.
Solution:
r² = (0.60)² = 0.36 = 36%
Interpretation: 36% of the variation in weight loss is explained by exercise hours. The remaining 64% is due to other factors such as diet, metabolism, genetics, sleep, etc.
This tells us that exercise is a meaningful factor in weight loss, but it's not the only factor - diet and other variables matter significantly too.