Module 11 Study Guide
Simple Linear Regression - Complete Reference
Home > Intro Stats > Module 11 > Study Guide
Tip: This study guide is printer-friendly. Use Ctrl+P (Cmd+P on Mac) to print for offline study.
Table of Contents
1. Correlation and Scatter Plots
Bivariate Data
- Bivariate data: Pairs of observations on two quantitative variables
- Explanatory variable (x): Independent, predictor variable
- Response variable (y): Dependent, outcome variable
Scatter Plots
Visual display of bivariate data with x on horizontal axis, y on vertical axis.
Describe patterns using:
- Direction: Positive, negative, or no association
- Form: Linear, curved, or no pattern
- Strength: How closely points follow the pattern
Correlation Coefficient (r)
Formula:
Properties of r:
- Range: -1 ≤ r ≤ 1
- Unitless: No units, can compare across contexts
- Symmetric: r(x,y) = r(y,x)
- Linear only: Measures LINEAR relationships only
- Sensitive to outliers: Extreme points can change r dramatically
| r Value | Interpretation |
|---|---|
| r = 1 | Perfect positive linear relationship |
| 0.7 ≤ r < 1 | Strong positive linear relationship |
| 0.3 ≤ r < 0.7 | Moderate positive linear relationship |
| 0 < r < 0.3 | Weak positive linear relationship |
| r = 0 | No linear relationship |
| -0.3 < r < 0 | Weak negative linear relationship |
| -0.7 < r ≤ -0.3 | Moderate negative linear relationship |
| -1 < r ≤ -0.7 | Strong negative linear relationship |
| r = -1 | Perfect negative linear relationship |
CORRELATION ≠ CAUSATION
Just because two variables are correlated does NOT mean one causes the other!
Possible explanations for correlation:
- x causes y
- y causes x (reverse causation)
- Third variable causes both (confounding)
- Coincidence (spurious correlation)
To establish causation: Need a well-designed randomized controlled experiment
2. The Regression Equation
Simple Linear Regression Model
Regression Equation:
Where:
- ŷ (y-hat) = predicted value of y
- b₀ = y-intercept (predicted y when x = 0)
- b₁ = slope (change in ŷ for 1-unit increase in x)
- x = value of explanatory variable
Calculating Slope and Intercept
Slope:
OR
Intercept:
Interpreting Slope and Intercept
Slope (b₁): "For each 1-[unit of x] increase in [x variable], [y variable] increases/decreases by [|b₁|] [units of y] on average."
Intercept (b₀): "When [x variable] = 0, the predicted [y variable] is [b₀] [units of y]."
Note: Intercept only meaningful if x = 0 is in data range and makes sense contextually.
Residuals
Residual:
- Positive residual: Point above line (actual > predicted)
- Negative residual: Point below line (actual < predicted)
- Sum of residuals = 0 (always, for least-squares line)
Standard Error of the Estimate
Measures typical size of prediction errors. Smaller sₑ = better fit.
Coefficient of Determination (r²)
Interpretation: Proportion of variation in y explained by x
- Range: 0 ≤ r² ≤ 1 (or 0% to 100%)
- r² = 0.75 means 75% of variation in y is explained by x
- Always positive (even when r is negative)
Interpolation vs Extrapolation
| Interpolation | Extrapolation |
|---|---|
| Predicting within data range | Predicting outside data range |
| Safe and reliable | Risky and potentially unreliable |
| Example: Data from x = 10 to 50, predict at x = 30 | Example: Data from x = 10 to 50, predict at x = 100 |
3. Hypothesis Testing in Regression
Testing Correlation (ρ)
Hypotheses:
- H₀: ρ = 0 (No linear relationship)
- Hₐ: ρ ≠ 0 (Linear relationship exists)
Test Statistic:
df = n - 2
Testing Slope (β₁)
Hypotheses:
- H₀: β₁ = 0 (No linear relationship)
- Hₐ: β₁ ≠ 0 (Linear relationship exists)
Test Statistic:
Where: SE(b₁) = sₑ / √[Σ(x - x̄)²]
df = n - 2
Important: Equivalent Tests
Testing ρ = 0 is equivalent to testing β₁ = 0
- Same t-value (approximately)
- Same p-value
- Same conclusion
- Both ask: "Is there a significant linear relationship?"
LINE Conditions for Regression Inference
| Condition | What to Check | How to Check |
|---|---|---|
| L - Linearity | Relationship is linear | Scatter plot shows linear pattern; residual plot shows no curve |
| I - Independence | Observations are independent | Based on study design (random sampling, no time series) |
| N - Normality | Residuals are normally distributed | Histogram of residuals roughly bell-shaped; less critical if n ≥ 30 |
| E - Equal Variance | Constant spread of residuals | Residual plot shows constant vertical spread (no fan shape) |
Residual Plots
Good residual plot: Random scatter around zero with constant spread
Problems to look for:
- Curved pattern: Linearity violated
- Fan shape: Equal variance violated
- Outliers: May affect regression
4. Confidence and Prediction Intervals
Confidence Interval for Slope (β₁)
Interpretation: We are C% confident that the true slope is in this interval.
Connection to hypothesis testing: If interval doesn't contain 0, reject H₀: β₁ = 0
Confidence Interval for Mean Response
Purpose: Estimate AVERAGE y value for all individuals at x*
Where:
Prediction Interval for Individual Response
Purpose: Predict a SINGLE y value at x*
Where:
Note the "1 +" at the beginning! This accounts for individual variation.
Comparing Confidence vs Prediction Intervals
| Aspect | Confidence Interval | Prediction Interval |
|---|---|---|
| Estimates | Mean y for all at x* | Single y for one at x* |
| Width | Narrower | Wider |
| SE formula | sₑ√[1/n + ...] | sₑ√[1 + 1/n + ...] |
| Why? | Averages are predictable | Individuals vary more |
| Example | "Average price of 2000-sq-ft houses" | "Price of this specific house" |
Why is PI wider than CI?
Prediction interval accounts for two sources of uncertainty:
- Uncertainty in estimating the mean (same as CI)
- Individual variation around the mean (the "1 +" term)
Rule: Prediction Interval is ALWAYS wider than Confidence Interval for the same x*
Complete Formula Sheet
| Concept | Formula |
|---|---|
| Correlation coefficient | r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²] |
| Slope | b₁ = r × (sᵧ / sₓ) OR b₁ = Σ[(x - x̄)(y - ȳ)] / Σ(x - x̄)² |
| Intercept | b₀ = ȳ - b₁x̄ |
| Regression equation | ŷ = b₀ + b₁x |
| Residual | y - ŷ |
| Standard error of estimate | sₑ = √[Σ(y - ŷ)² / (n - 2)] |
| Coefficient of determination | r² |
| Test statistic for ρ | t = r√[(n-2)/(1-r²)], df = n - 2 |
| Standard error of slope | SE(b₁) = sₑ / √[Σ(x - x̄)²] |
| Test statistic for β₁ | t = b₁ / SE(b₁), df = n - 2 |
| CI for slope | b₁ ± t* × SE(b₁) |
| SE for mean response | SE(ŷmean) = sₑ√[1/n + (x* - x̄)²/Σ(x - x̄)²] |
| CI for mean response | ŷ ± t* × SE(ŷmean) |
| SE for individual response | SE(ŷind) = sₑ√[1 + 1/n + (x* - x̄)²/Σ(x - x̄)²] |
| Prediction interval | ŷ ± t* × SE(ŷind) |
Step-by-Step Procedures
Procedure: Finding the Regression Equation
- Calculate means: x̄ and ȳ
- Calculate standard deviations: sₓ and sᵧ (or calculate deviations)
- Calculate correlation r
- Calculate slope: b₁ = r × (sᵧ / sₓ)
- Calculate intercept: b₀ = ȳ - b₁x̄
- Write equation: ŷ = b₀ + b₁x
Procedure: Hypothesis Test for Slope
- State hypotheses: H₀: β₁ = 0, Hₐ: β₁ ≠ 0 (or one-sided)
- Check LINE conditions
- Calculate test statistic: t = b₁ / SE(b₁)
- Find p-value using t-distribution with df = n - 2
- Make decision: If p < α, reject H₀
- State conclusion in context
Procedure: Constructing Prediction Interval
- Calculate predicted value: ŷ at x = x*
- Calculate SE(ŷind) = sₑ√[1 + 1/n + (x* - x̄)²/Σ(x - x̄)²]
- Find critical value t* for desired confidence level with df = n - 2
- Calculate margin of error: ME = t* × SE(ŷind)
- Construct interval: ŷ ± ME
- Interpret in context
Common Mistakes and Tips
Common Mistakes
- Confusing correlation with causation - Always remember: correlation ≠ causation!
- Using r when r² is asked for (or vice versa) - They're different! r² = r × r
- Forgetting to square r when calculating r²
- Thinking negative r means weak relationship - Strength is |r|, sign is direction
- Extrapolating beyond data range - Very risky!
- Confusing slope and intercept - Slope is rate of change, intercept is value at x = 0
- Wrong residual sign - Remember: residual = y - ŷ (actual minus predicted)
- Forgetting LINE conditions - Must check before inference!
- Using wrong df - Always df = n - 2 for regression
- Confusing CI and PI - CI for mean, PI for individual; PI is wider
Study Tips
- Always look at scatter plot first before calculating r
- Check units in your interpretations
- Practice interpreting - exams often ask for interpretations, not just calculations
- Memorize LINE acronym for conditions
- Remember the "1 +" in prediction interval formula
- Use technology for calculations in practice, but know the concepts
- Practice reading regression output from software