Learn Without Walls
← Back to Statistics with R
Module 7 of 8 — Statistics with R

Multiple Regression

Predicting with multiple variables

← Module 6: Correlation & Regression Module 7 of 8 Module 8: Capstone →
⏳ Loading R... (first load takes ~15 seconds)

📌 Before You Start

What you need: Module 6 completed — especially simple linear regression with lm() and interpretation of R².

What you’ll learn: How to extend regression to multiple predictors. How to interpret each coefficient "holding all other variables constant." The difference between R² and adjusted R². How to check residual plots to validate assumptions.

📖 The Concept: Multiple Regression

Multiple regression extends simple regression to include several predictor variables simultaneously. This lets us:

Interpreting coefficients: Each β represents the change in y for a one-unit increase in that predictor, holding all other predictors constant. This "holding constant" part is key.

Adjusted R² penalizes for adding predictors that don’t help. It only increases if a new predictor genuinely explains more variance than expected by chance. Always use adjusted R² to compare models with different numbers of predictors.

Residual assumptions: Residuals (actual − predicted) should be: (1) randomly scattered around zero, (2) have constant variance (no fanning), (3) be approximately normal for inference to be valid.

🔢 The Formula

y = β&sub0; + β&sub1;x&sub1; + β&sub2;x&sub2; + … + βkxk + ε

Each βi = effect of xi holding all other predictors constant  |  ε = residual error

Adj R² = 1 − (1 − R²) × (n − 1) / (n − k − 1)

n = sample size  |  k = number of predictors  |  Penalizes unnecessary variables

💻 In R — Worked Example (read-only)

Predicting exam scores from three variables: study hours, sleep hours, and prior GPA. Each coefficient is interpreted holding the other two constant.

set.seed(42) n <- 100 study_hours <- runif(n, 1, 10) sleep_hours <- rnorm(n, 7, 1) prior_gpa <- runif(n, 2.5, 4.0) exam_score <- 20 + 4*study_hours + 3*sleep_hours + 10*prior_gpa + rnorm(n, 0, 8) exam_score <- pmin(100, pmax(0, exam_score)) df <- data.frame(study_hours, sleep_hours, prior_gpa, exam_score) model <- lm(exam_score ~ study_hours + sleep_hours + prior_gpa, data=df) cat("=== Multiple Regression Results ===\n") print(summary(model)$coefficients) cat("\nAdjusted R²:", round(summary(model)$adj.r.squared, 3), "\n") cat("\nInterpretation:\n") coefs <- coef(model) cat("- Each extra study hour adds", round(coefs['study_hours'],2), "points (holding others constant)\n") cat("- Each extra sleep hour adds", round(coefs['sleep_hours'],2), "points\n") cat("- Each GPA point adds", round(coefs['prior_gpa'],2), "points\n")

🖐️ Your Turn

Exercise 1 — Salary Prediction with Multiple Predictors

Build a model predicting salary from years_experience, education_level (1–4), and dept_size. Which predictor has the strongest impact? Interpret each coefficient.

Output will appear here...
💡 Note on comparison: Raw coefficients can’t be directly compared across predictors with different scales (years vs. dollars vs. headcount). For fair comparison, you’d need standardized coefficients (scale each variable to mean 0, SD 1 first).

Exercise 2 — Model Comparison: Does Adding Variables Help?

Fit three models for exam score data: one predictor, two predictors, all three. Compare R² and adjusted R². Does adding variables always improve the model?

Output will appear here...
💡 What to notice: R² goes up every time you add a variable (even noise). Adjusted R² may go up or down — it only increases when the new variable genuinely helps.

Exercise 3 — Residual Analysis

For the salary model from Exercise 1, plot the residuals vs. fitted values. Random scatter around zero = good. Any pattern = problem (the model is missing something).

Output will appear here...

🧠 Brain Break

Multiple regression is how real-world statistical analysis works. We almost never study one variable in isolation.

Think about it: If you run a simple regression of salary on education and find a big coefficient, it might look like education “causes” high salaries. But if experienced workers also happen to be more educated, multiple regression lets you separate those effects — holding years_experience constant.

✅ Key Takeaway

Multiple regression controls for confounders by holding other variables constant. Adjusted R² is more honest than R² for comparing models. Always check residual plots — random scatter around zero means your model’s assumptions are met.

🏆 Module 7 Complete!

You now know how to build and interpret multiple regression models — one of the most widely used statistical tools in research and industry. One module left: bring it all together in a full analysis.

Continue to Module 8: Capstone →

← Module 6: Correlation & Regression Module 7 of 8 Module 8: Capstone →