Learn Without Walls

Lesson 2: The Regression Equation

Finding the best-fit line, making predictions, and understanding residuals

Home > Intro Stats > Module 11 > Lesson 2

Learning Objectives

By the end of this lesson, you will be able to:

1. Simple Linear Regression Model

In Lesson 1, we used scatter plots to visualize relationships and the correlation coefficient (r) to measure their strength. But what if we want to predict the value of y for a given x?

That's where regression comes in!

Simple Linear Regression

Simple linear regression finds the "best-fit" straight line through the data that we can use to predict y from x.

The Regression Equation

ŷ = b₀ + b₁x

Where:

  • ŷ (y-hat) = predicted value of y
  • b₀ = y-intercept (predicted y when x = 0)
  • b₁ = slope (change in ŷ for each 1-unit increase in x)
  • x = value of the explanatory variable

Regression Line Through Data

The Least-Squares Method

How do we find the "best" line? We use the least-squares method, which finds the line that minimizes the sum of squared residuals.

Least-Squares Criterion

The least-squares regression line is the line that makes the sum of squared vertical distances from points to the line as small as possible.

Minimize: Σ(y - ŷ)²

This line has special properties:

  • It always passes through the point (x̄, ȳ)
  • The sum of residuals always equals zero: Σ(y - ŷ) = 0
  • It's the unique line that minimizes prediction error

2. Calculating Slope and Intercept

Formulas for Slope and Intercept

Slope (b₁):

b₁ = r × (sᵧ / sₓ)

OR equivalently:

b₁ = Σ[(x - x̄)(y - ȳ)] / Σ(x - x̄)²

Y-intercept (b₀):

b₀ = ȳ - b₁x̄

Where:

  • r = correlation coefficient
  • sₓ = standard deviation of x
  • sᵧ = standard deviation of y
  • x̄ = mean of x, ȳ = mean of y

Interpreting Slope and Intercept

Interpreting the Slope (b₁)

Template: "For each 1-[unit of x] increase in [x variable], [y variable] increases/decreases by [|b₁|] [units of y] on average."

Examples:

  • If b₁ = 5.2 for predicting test score from study hours: "For each additional hour of study, the test score increases by 5.2 points on average."
  • If b₁ = -3000 for predicting car value from age: "For each additional year of age, the car's value decreases by $3,000 on average."

Interpreting the Y-intercept (b₀)

Template: "When [x variable] = 0, the predicted [y variable] is [b₀] [units of y]."

Warning: The intercept is only meaningful if x = 0 is within the range of the data and makes sense contextually. Otherwise, it's just a mathematical necessity for the equation.

Examples:

  • If b₀ = 50 for predicting test score from study hours: "When study hours = 0, the predicted test score is 50 points." (This makes sense - some baseline score without studying)
  • If b₀ = -100 for predicting weight from height: "When height = 0 inches, predicted weight is -100 pounds." (This is nonsense! Height of 0 is not in our data range. The intercept is just mathematical.)

Complete Example: Finding the Regression Equation

Let's find the regression equation for predicting test scores from study hours using our data from Lesson 1:

Study Hours (x) Test Score (y)
265
475
680
890
1095

From Lesson 1, we calculated:

  • r = 0.993 (correlation coefficient)
  • x̄ = 6 hours
  • ȳ = 81 points

Step 1: Calculate standard deviations

  • sₓ = √[Σ(x - x̄)² / (n-1)] = √(40/4) = √10 ≈ 3.162 hours
  • sᵧ = √[Σ(y - ȳ)² / (n-1)] = √(570/4) = √142.5 ≈ 11.94 points

Step 2: Calculate slope

b₁ = r × (sᵧ / sₓ) = 0.993 × (11.94 / 3.162) = 0.993 × 3.775 ≈ 3.75 points/hour

Step 3: Calculate intercept

b₀ = ȳ - b₁x̄ = 81 - (3.75)(6) = 81 - 22.5 = 58.5 points

Step 4: Write the equation

ŷ = 58.5 + 3.75x

Interpretation:

  • Slope (3.75): For each additional hour of study, the test score increases by 3.75 points on average.
  • Intercept (58.5): When study hours = 0, the predicted test score is 58.5 points.

3. Making Predictions with the Regression Equation

Once we have the regression equation, we can use it to predict y for any value of x by simply plugging in the x value.

Example: Making Predictions

Using our equation: ŷ = 58.5 + 3.75x

Question 1: Predict the test score for a student who studies 7 hours.

Solution:

ŷ = 58.5 + 3.75(7) = 58.5 + 26.25 = 84.75 points

Interpretation: We predict that a student who studies 7 hours will score approximately 84.75 points on the test.

Question 2: Predict the test score for a student who studies 5 hours.

Solution:

ŷ = 58.5 + 3.75(5) = 58.5 + 18.75 = 77.25 points

Question 3: Predict the test score for a student who studies 15 hours.

Solution:

ŷ = 58.5 + 3.75(15) = 58.5 + 56.25 = 114.75 points

Problem! This prediction is 114.75 points, but test scores can't exceed 100! This is an example of extrapolation - predicting outside the data range.

Interpolation vs Extrapolation

Interpolation vs Extrapolation

Interpolation: Making predictions within the range of x values in the data

  • Generally safe and reliable
  • Example: Our data ranges from x = 2 to x = 10, so predicting at x = 7 is interpolation

Extrapolation: Making predictions outside the range of x values in the data

  • Risky and potentially unreliable
  • Assumes the linear relationship continues beyond the observed data
  • Can lead to nonsensical predictions
  • Example: Our data ranges from x = 2 to x = 10, so predicting at x = 15 is extrapolation

Dangers of Extrapolation

Avoid extrapolation whenever possible! Relationships that are linear within the observed range may become non-linear or break down entirely outside that range.

Examples of extrapolation gone wrong:

  • Predicting a 200-inch person's weight (height-weight relationship changes at extremes)
  • Predicting test scores for 20 hours of study (diminishing returns, fatigue)
  • Predicting house price for a 50-bedroom house (relationship breaks down at extremes)

4. Residuals

Predictions are rarely perfect. The difference between the actual y value and the predicted ŷ value is called a residual.

Residuals

A residual is the difference between the observed value and the predicted value:

Residual = y - ŷ = Observed - Predicted
  • Positive residual: Actual y is above the line (prediction was too low)
  • Negative residual: Actual y is below the line (prediction was too high)
  • Zero residual: Point falls exactly on the line (perfect prediction)

Visualizing Residuals

Example: Calculating Residuals

Using our equation ŷ = 58.5 + 3.75x, let's calculate residuals for our five data points:

x (hours) y (actual score) ŷ (predicted) Residual (y - ŷ) Interpretation
2 65 58.5 + 3.75(2) = 66 65 - 66 = -1 Slightly below predicted
4 75 58.5 + 3.75(4) = 73.5 75 - 73.5 = +1.5 Slightly above predicted
6 80 58.5 + 3.75(6) = 81 80 - 81 = -1 Slightly below predicted
8 90 58.5 + 3.75(8) = 88.5 90 - 88.5 = +1.5 Slightly above predicted
10 95 58.5 + 3.75(10) = 96 95 - 96 = -1 Slightly below predicted
Sum of residuals: 0 Always!

Note: The sum of residuals always equals 0 for the least-squares line. This is a mathematical property of the least-squares method.

Standard Error of the Estimate (sₑ)

While individual residuals tell us about specific points, the standard error measures the typical size of prediction errors.

Standard Error of the Estimate

sₑ = √[Σ(y - ŷ)² / (n - 2)]

This measures the typical distance points fall from the regression line.

  • Smaller sₑ = better fit (predictions more accurate)
  • Larger sₑ = poorer fit (more prediction error)
  • Units are the same as y
  • We use (n - 2) in the denominator because we estimate 2 parameters (b₀ and b₁)

5. Coefficient of Determination (r²)

The coefficient of determination, , tells us how much of the variation in y is explained by x.

Coefficient of Determination (r²)

r² = (correlation coefficient)²

Interpretation: r² is the proportion of variation in y explained by the linear relationship with x.

  • Range: 0 ≤ r² ≤ 1 (or 0% to 100%)
  • r² = 0: x explains none of the variation in y
  • r² = 1: x explains all of the variation in y (perfect fit)
  • r² = 0.75: x explains 75% of the variation in y; 25% is unexplained

Note: r² is always positive (even when r is negative) because it's a squared value.

Example: Calculating and Interpreting r²

From our study hours example:

  • r = 0.993
  • r² = (0.993)² = 0.986 = 98.6%

Interpretation: 98.6% of the variation in test scores is explained by the linear relationship with study hours. Only 1.4% of the variation is due to other factors (aptitude, prior knowledge, sleep, etc.).

This tells us: Study hours is an excellent predictor of test scores in this dataset!

More r² Examples

Example 1: Height and weight, r = 0.70

  • r² = (0.70)² = 0.49 = 49%
  • Interpretation: 49% of variation in weight is explained by height. 51% is due to other factors (muscle mass, diet, genetics, etc.).

Example 2: Car age and value, r = -0.90

  • r² = (-0.90)² = 0.81 = 81%
  • Interpretation: 81% of variation in car value is explained by age. 19% is due to other factors (condition, mileage, make/model, etc.).
  • Note: r² is positive even though r is negative!

Example 3: Shoe size and salary, r = 0.25

  • r² = (0.25)² = 0.0625 = 6.25%
  • Interpretation: Only 6.25% of variation in salary is "explained" by shoe size. This weak relationship is likely spurious or due to confounding variables.

Check Your Understanding

Question 1: The regression equation for predicting car price (in thousands) from age (in years) is ŷ = 25 - 2.5x. Interpret the slope.

Slope = -2.5

Interpretation: For each additional year of age, the car's price decreases by $2,500 on average (or 2.5 thousand dollars).

The negative sign indicates that price decreases as age increases.

Question 2: Using the equation from Question 1, predict the price of a 6-year-old car.

Solution:

ŷ = 25 - 2.5(6) = 25 - 15 = 10 thousand dollars = $10,000

Interpretation: We predict a 6-year-old car will be worth approximately $10,000.

Question 3: If a 6-year-old car is actually worth $12,000, what is the residual?

Solution:

Residual = y - ŷ = 12 - 10 = +2 thousand dollars

Interpretation: The actual price is $2,000 higher than predicted. This car is worth more than the model predicts (perhaps it's well-maintained, low mileage, or a desirable model).

The positive residual means the point is above the regression line.

Question 4: If r = 0.60 for the relationship between exercise hours and weight loss, what is r²? Interpret it.

Solution:

r² = (0.60)² = 0.36 = 36%

Interpretation: 36% of the variation in weight loss is explained by exercise hours. The remaining 64% is due to other factors such as diet, metabolism, genetics, sleep, etc.

This tells us that exercise is a meaningful factor in weight loss, but it's not the only factor - diet and other variables matter significantly too.

← Lesson 1: Introduction Next: Lesson 3 - Hypothesis Testing →