Multiple Regression

Predicting with multiple variables

← Module 6: Correlation & Regression Module 7 of 8 Module 8: Capstone →

Loading R... (first load takes ~15 seconds)

Before You Start

What you need: Module 6 completed — especially simple linear regression with lm() and interpretation of R².

What you’ll learn: How to extend regression to multiple predictors. How to interpret each coefficient "holding all other variables constant." The difference between R² and adjusted R². How to check residual plots to validate assumptions.

The Concept: Multiple Regression

Multiple regression extends simple regression to include several predictor variables simultaneously. This lets us:

Control for confounders — hold other variables constant while studying one variable’s effect
Improve predictions — multiple predictors usually explain more variance than one alone
Understand relative importance — which predictors matter most?

Interpreting coefficients: Each β represents the change in y for a one-unit increase in that predictor, holding all other predictors constant. This "holding constant" part is key.

Adjusted R² penalizes for adding predictors that don’t help. It only increases if a new predictor genuinely explains more variance than expected by chance. Always use adjusted R² to compare models with different numbers of predictors.

Residual assumptions: Residuals (actual − predicted) should be: (1) randomly scattered around zero, (2) have constant variance (no fanning), (3) be approximately normal for inference to be valid.

The Formula

y = β&sub0; + β&sub1;x&sub1; + β&sub2;x&sub2; + … + β_kx_k + ε

Each β_i = effect of x_i holding all other predictors constant | ε = residual error

Adj R² = 1 − (1 − R²) × (n − 1) / (n − k − 1)

n = sample size | k = number of predictors | Penalizes unnecessary variables

In R — Worked Example (read-only)

Predicting exam scores from three variables: study hours, sleep hours, and prior GPA. Each coefficient is interpreted holding the other two constant.

set.seed(42) n <- 100 study_hours <- runif(n, 1, 10) sleep_hours <- rnorm(n, 7, 1) prior_gpa <- runif(n, 2.5, 4.0) exam_score <- 20 + 4*study_hours + 3*sleep_hours + 10*prior_gpa + rnorm(n, 0, 8) exam_score <- pmin(100, pmax(0, exam_score)) df <- data.frame(study_hours, sleep_hours, prior_gpa, exam_score) model <- lm(exam_score ~ study_hours + sleep_hours + prior_gpa, data=df) cat("=== Multiple Regression Results ===\n") print(summary(model)$coefficients) cat("\nAdjusted R²:", round(summary(model)$adj.r.squared, 3), "\n") cat("\nInterpretation:\n") coefs <- coef(model) cat("- Each extra study hour adds", round(coefs['study_hours'],2), "points (holding others constant)\n") cat("- Each extra sleep hour adds", round(coefs['sleep_hours'],2), "points\n") cat("- Each GPA point adds", round(coefs['prior_gpa'],2), "points\n")

Your Turn

Exercise 1 — Salary Prediction with Multiple Predictors

Build a model predicting salary from years_experience, education_level (1–4), and dept_size. Which predictor has the strongest impact? Interpret each coefficient.

set.seed(33)
n <- 80
years_exp   <- runif(n, 0, 25)
education   <- sample(1:4, n, replace = TRUE)  # 1=HS, 2=AA, 3=BA, 4=Graduate
dept_size   <- round(runif(n, 10, 200))

salary <- 30000 +
          2800  * years_exp +
          8000  * education +
          50    * dept_size +
          rnorm(n, 0, 12000)

df <- data.frame(years_exp, education, dept_size, salary)

model <- lm(salary ~ years_exp + education + dept_size, data = df)
cat("=== Salary Prediction Model ===\n")
cat("Coefficients:\n")
print(round(summary(model)$coefficients, 2))
cat("\nAdjusted R²:", round(summary(model)$adj.r.squared, 3), "\n\n")

coefs <- coef(model)
cat("Interpretation:\n")
cat("- Each year of experience adds $", round(coefs['years_exp'], 0), "\n")
cat("- Each education level adds $",    round(coefs['education'], 0), "\n")
cat("- Each additional employee adds $", round(coefs['dept_size'], 1), "\n\n")

cat("Which predictor is most important? Look at the largest absolute coefficient.\n")

Output will appear here...

Note on comparison: Raw coefficients can’t be directly compared across predictors with different scales (years vs. dollars vs. headcount). For fair comparison, you’d need standardized coefficients (scale each variable to mean 0, SD 1 first).

Exercise 2 — Model Comparison: Does Adding Variables Help?

Fit three models for exam score data: one predictor, two predictors, all three. Compare R² and adjusted R². Does adding variables always improve the model?

set.seed(42)
n <- 100
study_hours <- runif(n, 1, 10)
sleep_hours <- rnorm(n, 7, 1)
prior_gpa   <- runif(n, 2.5, 4.0)
exam_score  <- 20 + 4*study_hours + 3*sleep_hours + 10*prior_gpa + rnorm(n, 0, 8)
exam_score  <- pmin(100, pmax(0, exam_score))
df <- data.frame(study_hours, sleep_hours, prior_gpa, exam_score)

# Model 1: study hours only
m1 <- lm(exam_score ~ study_hours, data = df)

# Model 2: study hours + sleep hours
m2 <- lm(exam_score ~ study_hours + sleep_hours, data = df)

# Model 3: all three predictors
m3 <- lm(exam_score ~ study_hours + sleep_hours + prior_gpa, data = df)

cat("=== Model Comparison ===\n\n")
cat(sprintf("%-35s  R²     Adj R²\n", "Model"))
cat(sprintf("%-35s  %.3f  %.3f\n",
    "1: study hours only",
    summary(m1)$r.squared, summary(m1)$adj.r.squared))
cat(sprintf("%-35s  %.3f  %.3f\n",
    "2: study hours + sleep",
    summary(m2)$r.squared, summary(m2)$adj.r.squared))
cat(sprintf("%-35s  %.3f  %.3f\n",
    "3: study + sleep + gpa",
    summary(m3)$r.squared, summary(m3)$adj.r.squared))

cat("\nConclusion:\n")
cat("- R² always increases or stays same when you add predictors.\n")
cat("- Adjusted R² penalizes unnecessary predictors — more honest.\n")
cat("- Use adjusted R² to decide if an additional variable helps.\n")

Output will appear here...

What to notice: R² goes up every time you add a variable (even noise). Adjusted R² may go up or down — it only increases when the new variable genuinely helps.

Exercise 3 — Residual Analysis

For the salary model from Exercise 1, plot the residuals vs. fitted values. Random scatter around zero = good. Any pattern = problem (the model is missing something).

set.seed(33)
n <- 80
years_exp <- runif(n, 0, 25)
education <- sample(1:4, n, replace = TRUE)
dept_size <- round(runif(n, 10, 200))
salary    <- 30000 + 2800*years_exp + 8000*education + 50*dept_size + rnorm(n, 0, 12000)
df        <- data.frame(years_exp, education, dept_size, salary)

model <- lm(salary ~ years_exp + education + dept_size, data = df)

# Residual diagnostics — the four standard plots
par(mfrow = c(2, 2))
plot(model,
     col = "#00695C",
     pch = 19,
     cex = 0.7)
par(mfrow = c(1, 1))

cat("=== Reading the Residual Plots ===\n\n")
cat("1. Residuals vs Fitted: Should see random scatter around zero.\n")
cat("   A curve = nonlinear relationship; a fan = heteroscedasticity.\n\n")
cat("2. Q-Q Plot: Residuals should lie on the diagonal line.\n")
cat("   Points off the line = residuals are not normally distributed.\n\n")
cat("3. Scale-Location: Should see a flat line with random scatter.\n")
cat("   An upward slope = variance increases with fitted values.\n\n")
cat("4. Residuals vs Leverage: Watch for points in the upper-right corner.\n")
cat("   High leverage + high residual = influential observation.\n")

Output will appear here...

Brain Break

Multiple regression is how real-world statistical analysis works. We almost never study one variable in isolation.

Think about it: If you run a simple regression of salary on education and find a big coefficient, it might look like education “causes” high salaries. But if experienced workers also happen to be more educated, multiple regression lets you separate those effects — holding years_experience constant.

Key Takeaway

Multiple regression controls for confounders by holding other variables constant. Adjusted R² is more honest than R² for comparing models. Always check residual plots — random scatter around zero means your model’s assumptions are met.

Module 7 Complete!

You now know how to build and interpret multiple regression models — one of the most widely used statistical tools in research and industry. One module left: bring it all together in a full analysis.

Continue to Module 8: Capstone →

← Module 6: Correlation & Regression Module 7 of 8 Module 8: Capstone →