Module 11 Study Guide

Simple Linear Regression - Complete Reference

Home > Intro Stats > Module 11 > Study Guide

Tip: This study guide is printer-friendly. Use Ctrl+P (Cmd+P on Mac) to print for offline study.

Correlation and Scatter Plots
The Regression Equation
Hypothesis Testing in Regression
Confidence and Prediction Intervals
Complete Formula Sheet
Step-by-Step Procedures
Common Mistakes and Tips

1. Correlation and Scatter Plots

Bivariate Data

Bivariate data: Pairs of observations on two quantitative variables
Explanatory variable (x): Independent, predictor variable
Response variable (y): Dependent, outcome variable

Scatter Plots

Visual display of bivariate data with x on horizontal axis, y on vertical axis.

Describe patterns using:

Direction: Positive, negative, or no association
Form: Linear, curved, or no pattern
Strength: How closely points follow the pattern

Correlation Coefficient (r)

Formula:

r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]

Properties of r:

Range: -1 ≤ r ≤ 1
Unitless: No units, can compare across contexts
Symmetric: r(x,y) = r(y,x)
Linear only: Measures LINEAR relationships only
Sensitive to outliers: Extreme points can change r dramatically

r Value	Interpretation
r = 1	Perfect positive linear relationship
0.7 ≤ r < 1	Strong positive linear relationship
0.3 ≤ r < 0.7	Moderate positive linear relationship
0 < r < 0.3	Weak positive linear relationship
r = 0	No linear relationship
-0.3 < r < 0	Weak negative linear relationship
-0.7 < r ≤ -0.3	Moderate negative linear relationship
-1 < r ≤ -0.7	Strong negative linear relationship
r = -1	Perfect negative linear relationship

CORRELATION ≠ CAUSATION

Just because two variables are correlated does NOT mean one causes the other!

Possible explanations for correlation:

x causes y
y causes x (reverse causation)
Third variable causes both (confounding)
Coincidence (spurious correlation)

To establish causation: Need a well-designed randomized controlled experiment

2. The Regression Equation

Simple Linear Regression Model

Regression Equation:

ŷ = b₀ + b₁x

Where:

ŷ (y-hat) = predicted value of y
b₀ = y-intercept (predicted y when x = 0)
b₁ = slope (change in ŷ for 1-unit increase in x)
x = value of explanatory variable

Calculating Slope and Intercept

Slope:

b₁ = r × (sᵧ / sₓ)

b₁ = Σ[(x - x̄)(y - ȳ)] / Σ(x - x̄)²

Intercept:

b₀ = ȳ - b₁x̄

Interpreting Slope and Intercept

Slope (b₁): "For each 1-[unit of x] increase in [x variable], [y variable] increases/decreases by [|b₁|] [units of y] on average."

Intercept (b₀): "When [x variable] = 0, the predicted [y variable] is [b₀] [units of y]."

Note: Intercept only meaningful if x = 0 is in data range and makes sense contextually.

Residuals

Residual:

Residual = y - ŷ = Observed - Predicted

Positive residual: Point above line (actual > predicted)
Negative residual: Point below line (actual < predicted)
Sum of residuals = 0 (always, for least-squares line)

Standard Error of the Estimate

sₑ = √[Σ(y - ŷ)² / (n - 2)]

Measures typical size of prediction errors. Smaller sₑ = better fit.

Coefficient of Determination (r²)

r² = (correlation coefficient)²

Interpretation: Proportion of variation in y explained by x

Range: 0 ≤ r² ≤ 1 (or 0% to 100%)
r² = 0.75 means 75% of variation in y is explained by x
Always positive (even when r is negative)

Interpolation vs Extrapolation

Interpolation	Extrapolation
Predicting within data range	Predicting outside data range
Safe and reliable	Risky and potentially unreliable
Example: Data from x = 10 to 50, predict at x = 30	Example: Data from x = 10 to 50, predict at x = 100

3. Hypothesis Testing in Regression

Testing Correlation (ρ)

Hypotheses:

H₀: ρ = 0 (No linear relationship)
Hₐ: ρ ≠ 0 (Linear relationship exists)

Test Statistic:

t = r√[(n-2)/(1-r²)]

df = n - 2

Testing Slope (β₁)

Hypotheses:

H₀: β₁ = 0 (No linear relationship)
Hₐ: β₁ ≠ 0 (Linear relationship exists)

Test Statistic:

t = (b₁ - 0) / SE(b₁)

Where: SE(b₁) = sₑ / √[Σ(x - x̄)²]

df = n - 2

Important: Equivalent Tests

Testing ρ = 0 is equivalent to testing β₁ = 0

Same t-value (approximately)
Same p-value
Same conclusion
Both ask: "Is there a significant linear relationship?"

LINE Conditions for Regression Inference

Condition	What to Check	How to Check
L - Linearity	Relationship is linear	Scatter plot shows linear pattern; residual plot shows no curve
I - Independence	Observations are independent	Based on study design (random sampling, no time series)
N - Normality	Residuals are normally distributed	Histogram of residuals roughly bell-shaped; less critical if n ≥ 30
E - Equal Variance	Constant spread of residuals	Residual plot shows constant vertical spread (no fan shape)

Residual Plots

Good residual plot: Random scatter around zero with constant spread

Problems to look for:

Curved pattern: Linearity violated
Fan shape: Equal variance violated
Outliers: May affect regression

4. Confidence and Prediction Intervals

Confidence Interval for Slope (β₁)

b₁ ± t* × SE(b₁)

Interpretation: We are C% confident that the true slope is in this interval.

Connection to hypothesis testing: If interval doesn't contain 0, reject H₀: β₁ = 0

Confidence Interval for Mean Response

Purpose: Estimate AVERAGE y value for all individuals at x*

ŷ ± t* × SE(ŷ_mean)

Where:

SE(ŷ_mean) = sₑ√[1/n + (x* - x̄)² / Σ(x - x̄)²]

Prediction Interval for Individual Response

Purpose: Predict a SINGLE y value at x*

ŷ ± t* × SE(ŷ_ind)

Where:

SE(ŷ_ind) = sₑ√[1 + 1/n + (x* - x̄)² / Σ(x - x̄)²]

Note the "1 +" at the beginning! This accounts for individual variation.

Comparing Confidence vs Prediction Intervals

Aspect	Confidence Interval	Prediction Interval
Estimates	Mean y for all at x*	Single y for one at x*
Width	Narrower	Wider
SE formula	sₑ√[1/n + ...]	sₑ√[1 + 1/n + ...]
Why?	Averages are predictable	Individuals vary more
Example	"Average price of 2000-sq-ft houses"	"Price of this specific house"

Why is PI wider than CI?

Prediction interval accounts for two sources of uncertainty:

Uncertainty in estimating the mean (same as CI)
Individual variation around the mean (the "1 +" term)

Rule: Prediction Interval is ALWAYS wider than Confidence Interval for the same x*

Complete Formula Sheet

Concept	Formula
Correlation coefficient	r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]
Slope	b₁ = r × (sᵧ / sₓ) OR b₁ = Σ[(x - x̄)(y - ȳ)] / Σ(x - x̄)²
Intercept	b₀ = ȳ - b₁x̄
Regression equation	ŷ = b₀ + b₁x
Residual	y - ŷ
Standard error of estimate	sₑ = √[Σ(y - ŷ)² / (n - 2)]
Coefficient of determination	r²
Test statistic for ρ	t = r√[(n-2)/(1-r²)], df = n - 2
Standard error of slope	SE(b₁) = sₑ / √[Σ(x - x̄)²]
Test statistic for β₁	t = b₁ / SE(b₁), df = n - 2
CI for slope	b₁ ± t* × SE(b₁)
SE for mean response	SE(ŷ_mean) = sₑ√[1/n + (x* - x̄)²/Σ(x - x̄)²]
CI for mean response	ŷ ± t* × SE(ŷ_mean)
SE for individual response	SE(ŷ_ind) = sₑ√[1 + 1/n + (x* - x̄)²/Σ(x - x̄)²]
Prediction interval	ŷ ± t* × SE(ŷ_ind)

Step-by-Step Procedures

Procedure: Finding the Regression Equation

Calculate means: x̄ and ȳ
Calculate standard deviations: sₓ and sᵧ (or calculate deviations)
Calculate correlation r
Calculate slope: b₁ = r × (sᵧ / sₓ)
Calculate intercept: b₀ = ȳ - b₁x̄
Write equation: ŷ = b₀ + b₁x

Procedure: Hypothesis Test for Slope

State hypotheses: H₀: β₁ = 0, Hₐ: β₁ ≠ 0 (or one-sided)
Check LINE conditions
Calculate test statistic: t = b₁ / SE(b₁)
Find p-value using t-distribution with df = n - 2
Make decision: If p < α, reject H₀
State conclusion in context

Procedure: Constructing Prediction Interval

Calculate predicted value: ŷ at x = x*
Calculate SE(ŷ_ind) = sₑ√[1 + 1/n + (x* - x̄)²/Σ(x - x̄)²]
Find critical value t* for desired confidence level with df = n - 2
Calculate margin of error: ME = t* × SE(ŷ_ind)
Construct interval: ŷ ± ME
Interpret in context

Common Mistakes and Tips

Common Mistakes

Confusing correlation with causation - Always remember: correlation ≠ causation!
Using r when r² is asked for (or vice versa) - They're different! r² = r × r
Forgetting to square r when calculating r²
Thinking negative r means weak relationship - Strength is |r|, sign is direction
Extrapolating beyond data range - Very risky!
Confusing slope and intercept - Slope is rate of change, intercept is value at x = 0
Wrong residual sign - Remember: residual = y - ŷ (actual minus predicted)
Forgetting LINE conditions - Must check before inference!
Using wrong df - Always df = n - 2 for regression
Confusing CI and PI - CI for mean, PI for individual; PI is wider

Study Tips

Always look at scatter plot first before calculating r
Check units in your interpretations
Practice interpreting - exams often ask for interpretations, not just calculations
Memorize LINE acronym for conditions
Remember the "1 +" in prediction interval formula
Use technology for calculations in practice, but know the concepts
Practice reading regression output from software

← Back to Module 11

Learn Without Walls

Module 11 Study Guide

Table of Contents

1. Correlation and Scatter Plots

Bivariate Data

Scatter Plots

Correlation Coefficient (r)

CORRELATION ≠ CAUSATION

2. The Regression Equation

Simple Linear Regression Model

Calculating Slope and Intercept

Interpreting Slope and Intercept

Residuals

Standard Error of the Estimate

Coefficient of Determination (r²)

Interpolation vs Extrapolation

3. Hypothesis Testing in Regression

Testing Correlation (ρ)

Testing Slope (β₁)

Important: Equivalent Tests

LINE Conditions for Regression Inference

Residual Plots

4. Confidence and Prediction Intervals

Confidence Interval for Slope (β₁)

Confidence Interval for Mean Response

Prediction Interval for Individual Response

Comparing Confidence vs Prediction Intervals

Why is PI wider than CI?

Complete Formula Sheet

Step-by-Step Procedures

Procedure: Finding the Regression Equation

Procedure: Hypothesis Test for Slope

Procedure: Constructing Prediction Interval

Common Mistakes and Tips

Common Mistakes

Study Tips