Learn Without Walls

Module 11 Study Guide

Simple Linear Regression - Complete Reference

Home > Intro Stats > Module 11 > Study Guide

Tip: This study guide is printer-friendly. Use Ctrl+P (Cmd+P on Mac) to print for offline study.

Table of Contents

  1. Correlation and Scatter Plots
  2. The Regression Equation
  3. Hypothesis Testing in Regression
  4. Confidence and Prediction Intervals
  5. Complete Formula Sheet
  6. Step-by-Step Procedures
  7. Common Mistakes and Tips

1. Correlation and Scatter Plots

Bivariate Data

Scatter Plots

Visual display of bivariate data with x on horizontal axis, y on vertical axis.

Describe patterns using:

Correlation Coefficient (r)

Formula:

r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]

Properties of r:

r Value Interpretation
r = 1Perfect positive linear relationship
0.7 ≤ r < 1Strong positive linear relationship
0.3 ≤ r < 0.7Moderate positive linear relationship
0 < r < 0.3Weak positive linear relationship
r = 0No linear relationship
-0.3 < r < 0Weak negative linear relationship
-0.7 < r ≤ -0.3Moderate negative linear relationship
-1 < r ≤ -0.7Strong negative linear relationship
r = -1Perfect negative linear relationship

CORRELATION ≠ CAUSATION

Just because two variables are correlated does NOT mean one causes the other!

Possible explanations for correlation:

  • x causes y
  • y causes x (reverse causation)
  • Third variable causes both (confounding)
  • Coincidence (spurious correlation)

To establish causation: Need a well-designed randomized controlled experiment

2. The Regression Equation

Simple Linear Regression Model

Regression Equation:

ŷ = b₀ + b₁x

Where:

  • ŷ (y-hat) = predicted value of y
  • b₀ = y-intercept (predicted y when x = 0)
  • b₁ = slope (change in ŷ for 1-unit increase in x)
  • x = value of explanatory variable

Calculating Slope and Intercept

Slope:

b₁ = r × (sᵧ / sₓ)

OR

b₁ = Σ[(x - x̄)(y - ȳ)] / Σ(x - x̄)²

Intercept:

b₀ = ȳ - b₁x̄

Interpreting Slope and Intercept

Slope (b₁): "For each 1-[unit of x] increase in [x variable], [y variable] increases/decreases by [|b₁|] [units of y] on average."

Intercept (b₀): "When [x variable] = 0, the predicted [y variable] is [b₀] [units of y]."

Note: Intercept only meaningful if x = 0 is in data range and makes sense contextually.

Residuals

Residual:

Residual = y - ŷ = Observed - Predicted
  • Positive residual: Point above line (actual > predicted)
  • Negative residual: Point below line (actual < predicted)
  • Sum of residuals = 0 (always, for least-squares line)

Standard Error of the Estimate

sₑ = √[Σ(y - ŷ)² / (n - 2)]

Measures typical size of prediction errors. Smaller sₑ = better fit.

Coefficient of Determination (r²)

r² = (correlation coefficient)²

Interpretation: Proportion of variation in y explained by x

  • Range: 0 ≤ r² ≤ 1 (or 0% to 100%)
  • r² = 0.75 means 75% of variation in y is explained by x
  • Always positive (even when r is negative)

Interpolation vs Extrapolation

Interpolation Extrapolation
Predicting within data range Predicting outside data range
Safe and reliable Risky and potentially unreliable
Example: Data from x = 10 to 50, predict at x = 30 Example: Data from x = 10 to 50, predict at x = 100

3. Hypothesis Testing in Regression

Testing Correlation (ρ)

Hypotheses:

  • H₀: ρ = 0 (No linear relationship)
  • Hₐ: ρ ≠ 0 (Linear relationship exists)

Test Statistic:

t = r√[(n-2)/(1-r²)]

df = n - 2

Testing Slope (β₁)

Hypotheses:

  • H₀: β₁ = 0 (No linear relationship)
  • Hₐ: β₁ ≠ 0 (Linear relationship exists)

Test Statistic:

t = (b₁ - 0) / SE(b₁)

Where: SE(b₁) = sₑ / √[Σ(x - x̄)²]

df = n - 2

Important: Equivalent Tests

Testing ρ = 0 is equivalent to testing β₁ = 0

  • Same t-value (approximately)
  • Same p-value
  • Same conclusion
  • Both ask: "Is there a significant linear relationship?"

LINE Conditions for Regression Inference

Condition What to Check How to Check
L - Linearity Relationship is linear Scatter plot shows linear pattern; residual plot shows no curve
I - Independence Observations are independent Based on study design (random sampling, no time series)
N - Normality Residuals are normally distributed Histogram of residuals roughly bell-shaped; less critical if n ≥ 30
E - Equal Variance Constant spread of residuals Residual plot shows constant vertical spread (no fan shape)

Residual Plots

Good residual plot: Random scatter around zero with constant spread

Problems to look for:

  • Curved pattern: Linearity violated
  • Fan shape: Equal variance violated
  • Outliers: May affect regression

4. Confidence and Prediction Intervals

Confidence Interval for Slope (β₁)

b₁ ± t* × SE(b₁)

Interpretation: We are C% confident that the true slope is in this interval.

Connection to hypothesis testing: If interval doesn't contain 0, reject H₀: β₁ = 0

Confidence Interval for Mean Response

Purpose: Estimate AVERAGE y value for all individuals at x*

ŷ ± t* × SE(ŷmean)

Where:

SE(ŷmean) = sₑ√[1/n + (x* - x̄)² / Σ(x - x̄)²]

Prediction Interval for Individual Response

Purpose: Predict a SINGLE y value at x*

ŷ ± t* × SE(ŷind)

Where:

SE(ŷind) = sₑ√[1 + 1/n + (x* - x̄)² / Σ(x - x̄)²]

Note the "1 +" at the beginning! This accounts for individual variation.

Comparing Confidence vs Prediction Intervals

Aspect Confidence Interval Prediction Interval
Estimates Mean y for all at x* Single y for one at x*
Width Narrower Wider
SE formula sₑ√[1/n + ...] sₑ√[1 + 1/n + ...]
Why? Averages are predictable Individuals vary more
Example "Average price of 2000-sq-ft houses" "Price of this specific house"

Why is PI wider than CI?

Prediction interval accounts for two sources of uncertainty:

  1. Uncertainty in estimating the mean (same as CI)
  2. Individual variation around the mean (the "1 +" term)

Rule: Prediction Interval is ALWAYS wider than Confidence Interval for the same x*

Complete Formula Sheet

Concept Formula
Correlation coefficient r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]
Slope b₁ = r × (sᵧ / sₓ) OR b₁ = Σ[(x - x̄)(y - ȳ)] / Σ(x - x̄)²
Intercept b₀ = ȳ - b₁x̄
Regression equation ŷ = b₀ + b₁x
Residual y - ŷ
Standard error of estimate sₑ = √[Σ(y - ŷ)² / (n - 2)]
Coefficient of determination
Test statistic for ρ t = r√[(n-2)/(1-r²)], df = n - 2
Standard error of slope SE(b₁) = sₑ / √[Σ(x - x̄)²]
Test statistic for β₁ t = b₁ / SE(b₁), df = n - 2
CI for slope b₁ ± t* × SE(b₁)
SE for mean response SE(ŷmean) = sₑ√[1/n + (x* - x̄)²/Σ(x - x̄)²]
CI for mean response ŷ ± t* × SE(ŷmean)
SE for individual response SE(ŷind) = sₑ√[1 + 1/n + (x* - x̄)²/Σ(x - x̄)²]
Prediction interval ŷ ± t* × SE(ŷind)

Step-by-Step Procedures

Procedure: Finding the Regression Equation

  1. Calculate means: x̄ and ȳ
  2. Calculate standard deviations: sₓ and sᵧ (or calculate deviations)
  3. Calculate correlation r
  4. Calculate slope: b₁ = r × (sᵧ / sₓ)
  5. Calculate intercept: b₀ = ȳ - b₁x̄
  6. Write equation: ŷ = b₀ + b₁x

Procedure: Hypothesis Test for Slope

  1. State hypotheses: H₀: β₁ = 0, Hₐ: β₁ ≠ 0 (or one-sided)
  2. Check LINE conditions
  3. Calculate test statistic: t = b₁ / SE(b₁)
  4. Find p-value using t-distribution with df = n - 2
  5. Make decision: If p < α, reject H₀
  6. State conclusion in context

Procedure: Constructing Prediction Interval

  1. Calculate predicted value: ŷ at x = x*
  2. Calculate SE(ŷind) = sₑ√[1 + 1/n + (x* - x̄)²/Σ(x - x̄)²]
  3. Find critical value t* for desired confidence level with df = n - 2
  4. Calculate margin of error: ME = t* × SE(ŷind)
  5. Construct interval: ŷ ± ME
  6. Interpret in context

Common Mistakes and Tips

Common Mistakes

  1. Confusing correlation with causation - Always remember: correlation ≠ causation!
  2. Using r when r² is asked for (or vice versa) - They're different! r² = r × r
  3. Forgetting to square r when calculating r²
  4. Thinking negative r means weak relationship - Strength is |r|, sign is direction
  5. Extrapolating beyond data range - Very risky!
  6. Confusing slope and intercept - Slope is rate of change, intercept is value at x = 0
  7. Wrong residual sign - Remember: residual = y - ŷ (actual minus predicted)
  8. Forgetting LINE conditions - Must check before inference!
  9. Using wrong df - Always df = n - 2 for regression
  10. Confusing CI and PI - CI for mean, PI for individual; PI is wider

Study Tips

  • Always look at scatter plot first before calculating r
  • Check units in your interpretations
  • Practice interpreting - exams often ask for interpretations, not just calculations
  • Memorize LINE acronym for conditions
  • Remember the "1 +" in prediction interval formula
  • Use technology for calculations in practice, but know the concepts
  • Practice reading regression output from software
← Back to Module 11