Correlation & Simple Linear Regression

Relationships between variables

← Module 5: Hypothesis Testing Module 6 of 8 Module 7: Multiple Regression →

Loading R... (first load takes ~15 seconds)

Before You Start

What you need: Modules 1–5 completed, or familiarity with descriptive stats and basic hypothesis testing.

What you’ll learn: How to measure the strength and direction of linear relationships with correlation. How to fit a simple linear regression model with lm(). How to interpret slope, intercept, and R². The critical warning: always visualize your data (Anscombe’s Quartet).

The Concept: Correlation and Simple Linear Regression

Correlation (r) measures the strength and direction of a linear relationship between two variables.

r = +1: perfect positive linear relationship
r = −1: perfect negative linear relationship
r = 0: no linear relationship (but could have a nonlinear one!)
|r| > 0.7 is often called "strong"; |r| < 0.3 is "weak"

Simple linear regression goes further: it gives an equation to predict y from x. The slope tells you how much y changes per unit change in x. R² tells you what fraction of the variability in y is explained by x.

Critical warning: Correlation and regression measure linear relationships only. Two variables can have r = 0 but a strong nonlinear relationship. Always plot your data first!

The Formulas

ŷ = β&sub0; + β&sub1;x

β&sub0; = intercept (y when x=0) | β&sub1; = slope (change in y per unit x)

R² = 1 − (SS_res / SS_tot)

R² = proportion of variance in y explained by x | Range: 0 to 1

In R — Worked Example (read-only)

Study hours vs. exam scores. We compute correlation, fit a regression, interpret the output, and visualize the relationship.

library(ggplot2) set.seed(42) study_hours <- runif(50, 1, 8) exam_score <- 50 + 5*study_hours + rnorm(50, 0, 8) exam_score <- pmin(100, pmax(0, exam_score)) df <- data.frame(study_hours, exam_score) # Correlation r <- cor(study_hours, exam_score) cat("Correlation r:", round(r, 3), "\n") # Linear regression model <- lm(exam_score ~ study_hours, data=df) cat("Intercept:", round(coef(model)[1], 2), "\n") cat("Slope: ", round(coef(model)[2], 2), "\n") cat("R-squared:", round(summary(model)$r.squared, 3), "\n") cat("\nInterpretation: Each additional hour of study is associated with", round(coef(model)[2], 1), "more exam points.\n") # Visualize ggplot(df, aes(x=study_hours, y=exam_score)) + geom_point(alpha=0.6, color="#00695C") + geom_smooth(method="lm", color="#C62828", se=TRUE) + labs(title=paste0("Study Hours vs Exam Score (r = ", round(r,3), ")"), x="Study Hours per Week", y="Exam Score") + theme_minimal()

Your Turn

Exercise 1 — Salary vs. Experience

Generate data with a known relationship between years_experience and salary. Fit a regression model and interpret every output value — slope, intercept, R², and the correlation.

set.seed(77)
years_exp <- runif(60, 0, 20)
salary    <- 40000 + 3500 * years_exp + rnorm(60, 0, 8000)
df <- data.frame(years_exp, salary)

# Correlation
r <- cor(years_exp, salary)
cat("Correlation r:", round(r, 3), "\n\n")

# Regression model
model <- lm(salary ~ years_exp, data = df)
b0 <- coef(model)[1]  # intercept
b1 <- coef(model)[2]  # slope
r2 <- summary(model)$r.squared

cat("=== Regression Output ===\n")
cat("Intercept (b0):", round(b0, 0), "\n")
cat("Slope (b1):    ", round(b1, 0), "per year\n")
cat("R-squared:     ", round(r2, 3), "\n\n")

cat("=== Interpretation ===\n")
cat("A person with 0 years of experience earns ~$",
    format(round(b0, 0), big.mark=","), "on average.\n", sep="")
cat("Each additional year of experience adds ~$",
    format(round(b1, 0), big.mark=","), "to salary.\n", sep="")
cat(round(r2 * 100, 1), "% of salary variation is explained by experience.\n")
cat("The remaining", round((1-r2)*100, 1), "% is due to other factors.\n\n")

# Prediction: salary for 10 years experience
pred_10 <- b0 + b1 * 10
cat("Predicted salary at 10 years:", format(round(pred_10, 0), big.mark=","), "\n")

Output will appear here...

What to notice: The estimated slope (~3500) matches what we programmed in. The intercept (~40000) also roughly matches. R² tells you the explanatory power of experience alone.

Exercise 2 — Four Correlations, Four Patterns

Create four scatter plots with correlations of approximately r = 0.9, r = 0.5, r = 0, and r = −0.7. See visually what each correlation value looks like.

set.seed(42)
n <- 100

# Helper to generate data with approximate correlation r
gen_corr <- function(r, n) {
  x <- rnorm(n)
  y <- r * x + sqrt(1 - r^2) * rnorm(n)
  list(x = x, y = y)
}

d_strong_pos <- gen_corr(0.9,  n)
d_mod_pos    <- gen_corr(0.5,  n)
d_none       <- gen_corr(0.0,  n)
d_neg        <- gen_corr(-0.7, n)

par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))

plot(d_strong_pos$x, d_strong_pos$y,
     main = paste("r =", round(cor(d_strong_pos$x, d_strong_pos$y), 2)),
     xlab = "x", ylab = "y", col = "#004D40", pch = 19, cex = 0.7)
abline(lm(d_strong_pos$y ~ d_strong_pos$x), col = "#C62828", lwd = 2)

plot(d_mod_pos$x, d_mod_pos$y,
     main = paste("r =", round(cor(d_mod_pos$x, d_mod_pos$y), 2)),
     xlab = "x", ylab = "y", col = "#00695C", pch = 19, cex = 0.7)
abline(lm(d_mod_pos$y ~ d_mod_pos$x), col = "#C62828", lwd = 2)

plot(d_none$x, d_none$y,
     main = paste("r =", round(cor(d_none$x, d_none$y), 2)),
     xlab = "x", ylab = "y", col = "#9E9E9E", pch = 19, cex = 0.7)
abline(lm(d_none$y ~ d_none$x), col = "#C62828", lwd = 2)

plot(d_neg$x, d_neg$y,
     main = paste("r =", round(cor(d_neg$x, d_neg$y), 2)),
     xlab = "x", ylab = "y", col = "#7B1FA2", pch = 19, cex = 0.7)
abline(lm(d_neg$y ~ d_neg$x), col = "#C62828", lwd = 2)

par(mfrow = c(1, 1))

Output will appear here...

Exercise 3 — Anscombe’s Quartet: Always Visualize!

Anscombe’s Quartet is a famous dataset of 4 datasets that have nearly identical statistics (mean, variance, correlation, regression line) but look completely different when plotted. This is why visualization always comes first.

data(anscombe)  # Built-in dataset in R

cat("=== Anscombe's Quartet: Statistics ===\n\n")

for (i in 1:4) {
  x_col <- paste0("x", i)
  y_col <- paste0("y", i)
  x <- anscombe[[x_col]]
  y <- anscombe[[y_col]]
  model <- lm(y ~ x)

  cat("Dataset", i, ":\n")
  cat("  mean(x) =", round(mean(x), 2),
      " | mean(y) =", round(mean(y), 2), "\n")
  cat("  r       =", round(cor(x, y), 3),
      " | R2      =", round(summary(model)$r.squared, 3), "\n")
  cat("  slope   =", round(coef(model)[2], 3),
      " | intercept =", round(coef(model)[1], 3), "\n\n")
}

cat("All four datasets have nearly IDENTICAL statistics.\n")
cat("But do they look the same? Run the plot below to find out!\n\n")

# Visualize all four
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
cols <- c("#00695C", "#004D40", "#7B1FA2", "#C62828")

for (i in 1:4) {
  x <- anscombe[[paste0("x", i)]]
  y <- anscombe[[paste0("y", i)]]
  plot(x, y,
       main = paste0("Dataset ", i, "  (r=",
                     round(cor(x,y),3), ")"),
       xlab = "x", ylab = "y",
       col = cols[i], pch = 19, cex = 1.2,
       xlim = c(4, 19), ylim = c(2, 13))
  abline(lm(y ~ x), col = "black", lwd = 2)
}

par(mfrow = c(1, 1))

Output will appear here...

The lesson: Dataset 1 is a clean linear relationship. Dataset 2 is curved. Dataset 3 has an outlier. Dataset 4 has all points at one x-value except one. Same statistics — completely different patterns. Always plot your data!

Correlation ≠ Causation

A strong correlation between x and y does not mean x causes y. Both could be caused by a third variable (a confounder), or the relationship could be coincidental. Regression gives you a predictive equation — not proof of cause and effect. Causation requires experimental design, not just correlation.

Brain Break

Anscombe’s Quartet teaches the single most important data analysis lesson: statistics alone are not enough. See your data.

Quick check: If R² = 0.65, that means 65% of the variation in y is explained by x. The other 35% is due to other variables or random noise. A high R² does NOT mean the relationship is causal.

Key Takeaway

Correlation measures linear relationship strength (-1 to +1). Regression gives a predictive equation: y = β&sub0; + β&sub1;x. R² measures explanatory power. Always visualize your data first — Anscombe’s Quartet proves that identical statistics can hide completely different patterns.

Module 6 Complete!

You can now measure and model linear relationships in R. Next: we extend regression to multiple predictor variables, which lets us control for confounders.

Continue to Module 7: Multiple Regression →

← Module 5: Hypothesis Testing Module 6 of 8 Module 7: Multiple Regression →