Learn Without Walls
← Back to Statistics with R
Module 6 of 8 — Statistics with R

Correlation & Simple Linear Regression

Relationships between variables

← Module 5: Hypothesis Testing Module 6 of 8 Module 7: Multiple Regression →
⏳ Loading R... (first load takes ~15 seconds)

📌 Before You Start

What you need: Modules 1–5 completed, or familiarity with descriptive stats and basic hypothesis testing.

What you’ll learn: How to measure the strength and direction of linear relationships with correlation. How to fit a simple linear regression model with lm(). How to interpret slope, intercept, and R². The critical warning: always visualize your data (Anscombe’s Quartet).

📖 The Concept: Correlation and Simple Linear Regression

Correlation (r) measures the strength and direction of a linear relationship between two variables.

Simple linear regression goes further: it gives an equation to predict y from x. The slope tells you how much y changes per unit change in x. R² tells you what fraction of the variability in y is explained by x.

Critical warning: Correlation and regression measure linear relationships only. Two variables can have r = 0 but a strong nonlinear relationship. Always plot your data first!

🔢 The Formulas

ŷ = β&sub0; + β&sub1;x

β&sub0; = intercept (y when x=0)  |  β&sub1; = slope (change in y per unit x)

R² = 1 − (SSres / SStot)

R² = proportion of variance in y explained by x  |  Range: 0 to 1

💻 In R — Worked Example (read-only)

Study hours vs. exam scores. We compute correlation, fit a regression, interpret the output, and visualize the relationship.

library(ggplot2) set.seed(42) study_hours <- runif(50, 1, 8) exam_score <- 50 + 5*study_hours + rnorm(50, 0, 8) exam_score <- pmin(100, pmax(0, exam_score)) df <- data.frame(study_hours, exam_score) # Correlation r <- cor(study_hours, exam_score) cat("Correlation r:", round(r, 3), "\n") # Linear regression model <- lm(exam_score ~ study_hours, data=df) cat("Intercept:", round(coef(model)[1], 2), "\n") cat("Slope: ", round(coef(model)[2], 2), "\n") cat("R-squared:", round(summary(model)$r.squared, 3), "\n") cat("\nInterpretation: Each additional hour of study is associated with", round(coef(model)[2], 1), "more exam points.\n") # Visualize ggplot(df, aes(x=study_hours, y=exam_score)) + geom_point(alpha=0.6, color="#00695C") + geom_smooth(method="lm", color="#C62828", se=TRUE) + labs(title=paste0("Study Hours vs Exam Score (r = ", round(r,3), ")"), x="Study Hours per Week", y="Exam Score") + theme_minimal()

🖐️ Your Turn

Exercise 1 — Salary vs. Experience

Generate data with a known relationship between years_experience and salary. Fit a regression model and interpret every output value — slope, intercept, R², and the correlation.

Output will appear here...
💡 What to notice: The estimated slope (~3500) matches what we programmed in. The intercept (~40000) also roughly matches. R² tells you the explanatory power of experience alone.

Exercise 2 — Four Correlations, Four Patterns

Create four scatter plots with correlations of approximately r = 0.9, r = 0.5, r = 0, and r = −0.7. See visually what each correlation value looks like.

Output will appear here...

Exercise 3 — Anscombe’s Quartet: Always Visualize!

Anscombe’s Quartet is a famous dataset of 4 datasets that have nearly identical statistics (mean, variance, correlation, regression line) but look completely different when plotted. This is why visualization always comes first.

Output will appear here...
💡 The lesson: Dataset 1 is a clean linear relationship. Dataset 2 is curved. Dataset 3 has an outlier. Dataset 4 has all points at one x-value except one. Same statistics — completely different patterns. Always plot your data!

⚠️ Correlation ≠ Causation

A strong correlation between x and y does not mean x causes y. Both could be caused by a third variable (a confounder), or the relationship could be coincidental. Regression gives you a predictive equation — not proof of cause and effect. Causation requires experimental design, not just correlation.

🧠 Brain Break

Anscombe’s Quartet teaches the single most important data analysis lesson: statistics alone are not enough. See your data.

Quick check: If R² = 0.65, that means 65% of the variation in y is explained by x. The other 35% is due to other variables or random noise. A high R² does NOT mean the relationship is causal.

✅ Key Takeaway

Correlation measures linear relationship strength (-1 to +1). Regression gives a predictive equation: y = β&sub0; + β&sub1;x. R² measures explanatory power. Always visualize your data first — Anscombe’s Quartet proves that identical statistics can hide completely different patterns.

🏆 Module 6 Complete!

You can now measure and model linear relationships in R. Next: we extend regression to multiple predictor variables, which lets us control for confounders.

Continue to Module 7: Multiple Regression →

← Module 5: Hypothesis Testing Module 6 of 8 Module 7: Multiple Regression →