Correlation & Simple Linear Regression
Relationships between variables
📌 Before You Start
What you need: Modules 1–5 completed, or familiarity with descriptive stats and basic hypothesis testing.
What you’ll learn: How to measure the strength and direction of linear relationships with correlation. How to fit a simple linear regression model with lm(). How to interpret slope, intercept, and R². The critical warning: always visualize your data (Anscombe’s Quartet).
📖 The Concept: Correlation and Simple Linear Regression
Correlation (r) measures the strength and direction of a linear relationship between two variables.
- r = +1: perfect positive linear relationship
- r = −1: perfect negative linear relationship
- r = 0: no linear relationship (but could have a nonlinear one!)
- |r| > 0.7 is often called "strong"; |r| < 0.3 is "weak"
Simple linear regression goes further: it gives an equation to predict y from x. The slope tells you how much y changes per unit change in x. R² tells you what fraction of the variability in y is explained by x.
Critical warning: Correlation and regression measure linear relationships only. Two variables can have r = 0 but a strong nonlinear relationship. Always plot your data first!
🔢 The Formulas
β&sub0; = intercept (y when x=0) | β&sub1; = slope (change in y per unit x)
R² = proportion of variance in y explained by x | Range: 0 to 1
💻 In R — Worked Example (read-only)
Study hours vs. exam scores. We compute correlation, fit a regression, interpret the output, and visualize the relationship.
🖐️ Your Turn
Exercise 1 — Salary vs. Experience
Generate data with a known relationship between years_experience and salary. Fit a regression model and interpret every output value — slope, intercept, R², and the correlation.
Exercise 2 — Four Correlations, Four Patterns
Create four scatter plots with correlations of approximately r = 0.9, r = 0.5, r = 0, and r = −0.7. See visually what each correlation value looks like.
Exercise 3 — Anscombe’s Quartet: Always Visualize!
Anscombe’s Quartet is a famous dataset of 4 datasets that have nearly identical statistics (mean, variance, correlation, regression line) but look completely different when plotted. This is why visualization always comes first.
⚠️ Correlation ≠ Causation
A strong correlation between x and y does not mean x causes y. Both could be caused by a third variable (a confounder), or the relationship could be coincidental. Regression gives you a predictive equation — not proof of cause and effect. Causation requires experimental design, not just correlation.
🧠 Brain Break
Anscombe’s Quartet teaches the single most important data analysis lesson: statistics alone are not enough. See your data.
Quick check: If R² = 0.65, that means 65% of the variation in y is explained by x. The other 35% is due to other variables or random noise. A high R² does NOT mean the relationship is causal.
✅ Key Takeaway
Correlation measures linear relationship strength (-1 to +1). Regression gives a predictive equation: y = β&sub0; + β&sub1;x. R² measures explanatory power. Always visualize your data first — Anscombe’s Quartet proves that identical statistics can hide completely different patterns.
🏆 Module 6 Complete!
You can now measure and model linear relationships in R. Next: we extend regression to multiple predictor variables, which lets us control for confounders.