Save or print this lesson:

Module 8 of 8 — Statistics with R — CAPSTONE

Full Statistical Analysis in R

Everything together — a complete analysis workflow

← Module 7: Multiple Regression Module 8 of 8 — Final! Course Home

Loading R... (first load takes ~15 seconds)

Before You Start

What you need: All 7 previous modules completed (or strong familiarity with descriptive stats, distributions, CLT, confidence intervals, hypothesis testing, and regression).

What you’ll do:A complete statistical analysis from raw data to written conclusion, applying every technique from this course in a realistic research scenario.

Everything You’ve Built — Applied Here

This capstone integrates all 7 modules. Here’s what you’ll apply:

M1 Descriptive stats & spread

M2 Probability distributions

M3 Sampling & standard error

M4 Confidence intervals

M5 Hypothesis testing

M6 Correlation & simple regression

M7 Multiple regression & residuals

Your Scenario: University Research Analyst

You are a research analyst at a university. A faculty committee wants to understand what factors predict student exam performance. They’ve provided data on 120 students including GPA, study habits, sleep, anxiety, major, year, and first-generation status.

Your job: produce a complete statistical report. The committee needs actionable insights — not just numbers. Work through all 6 tasks in order, then write your conclusion in Task 6.

Important: Run Task 0 (Setup) first. The dataset persists across tasks within the same browser session.

Setup — Create the Student Dataset

Run this first. It creates the students data frame with 120 observations and 8 variables. You must run this before any other task.

set.seed(2024)
n <- 120
students <- data.frame(
  gpa           = pmax(1.5, pmin(4.0, rnorm(n, 3.1, 0.5))),
  study_hours   = round(runif(n, 3, 40), 1),
  sleep_hours   = pmax(4, pmin(10, rnorm(n, 7, 1.2))),
  anxiety_score = round(runif(n, 1, 10)),
  first_gen     = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.35, 0.65)),
  major         = sample(c("STEM","Humanities","Business","Arts"), n, replace=TRUE),
  year          = sample(1:4, n, replace=TRUE)
)
students$exam_score <- pmax(40, pmin(100, round(
  55 + 0.3*students$gpa*10 + 0.4*students$study_hours +
  1.5*students$sleep_hours - 1.2*students$anxiety_score + rnorm(n, 0, 8)
)))

cat("Dataset created: 'students'\n")
cat("Rows:", nrow(students), " | Columns:", ncol(students), "\n\n")
cat("Variables:", paste(names(students), collapse=", "), "\n\n")
cat("=== First 6 rows ===\n")
print(head(students, 6))
cat("\n=== Structure ===\n")
str(students)

Run this first to create the dataset!

Descriptive Statistics — Get to Know the Data

Compute summary statistics for all numeric variables. Which variables have the most variability (highest CV = SD/mean)? Are any variables notably skewed?

# Rebuild dataset (run if needed)
set.seed(2024); n <- 120
students <- data.frame(
  gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))),
  study_hours=round(runif(n,3,40),1),
  sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))),
  anxiety_score=round(runif(n,1,10)),
  first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)),
  major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE),
  year=sample(1:4,n,replace=TRUE))
students$exam_score <- pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8))))

# Summary statistics for all numeric variables
num_vars <- c("gpa", "study_hours", "sleep_hours", "anxiety_score", "exam_score")

cat("=== Descriptive Statistics ===\n\n")
for (v in num_vars) {
  x <- students[[v]]
  cat(sprintf("%-15s  mean=%5.2f  median=%5.2f  sd=%5.2f  IQR=%5.2f  CV=%.1f%%\n",
    v, mean(x), median(x), sd(x), IQR(x), sd(x)/mean(x)*100))
}

cat("\n=== Full Summary ===\n")
print(summary(students[, num_vars]))

cat("\n=== Major Distribution ===\n")
print(table(students$major))
cat("\n=== First-Gen Students ===\n")
cat("First-gen:", sum(students$first_gen),
    "  Non-first-gen:", sum(!students$first_gen), "\n")

Output will appear here...

Look for: Which variable has the largest CV? That’s the most variable relative to its mean. Are mean and median close for each variable (symmetric) or far apart (skewed)?

Normality Check — Is GPA Normally Distributed?

Produce a histogram of GPA and a Q-Q plot. If GPA is approximately normal, the Q-Q plot points should lie close to the diagonal line.

set.seed(2024); n <- 120
students <- data.frame(gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))),study_hours=round(runif(n,3,40),1),sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))),anxiety_score=round(runif(n,1,10)),first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)),major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE),year=sample(1:4,n,replace=TRUE))
students$exam_score<-pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8))))

par(mfrow = c(1, 2))

# Histogram of GPA
hist(students$gpa,
     main = "GPA Distribution",
     xlab = "GPA",
     col = "#B2DFDB", border = "white",
     breaks = 15, freq = FALSE)
curve(dnorm(x, mean(students$gpa), sd(students$gpa)),
      add = TRUE, col = "#004D40", lwd = 2)
abline(v = mean(students$gpa), col = "blue", lwd = 2, lty = 2)
abline(v = median(students$gpa), col = "red", lwd = 2)

# Q-Q Plot
qqnorm(students$gpa,
       main = "Q-Q Plot of GPA",
       col = "#00695C", pch = 19, cex = 0.7)
qqline(students$gpa, col = "#C62828", lwd = 2)

par(mfrow = c(1, 1))

# Descriptive stats
cat("GPA: mean =", round(mean(students$gpa),3),
    " | median =", round(median(students$gpa),3),
    " | SD =", round(sd(students$gpa),3), "\n")
cat("Skewness: mean - median =", round(mean(students$gpa)-median(students$gpa), 3), "\n")
cat("If points are on the Q-Q line: approximately normal. Good to proceed!\n")

Output will appear here...

Hypothesis Test — Do STEM Students Score Higher?

The committee hypothesizes that STEM students score higher on exams. Test this formally: H&sub0;: STEM mean = non-STEM mean. Report t-statistic, p-value, and plain-English conclusion.

set.seed(2024); n <- 120
students <- data.frame(gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))),study_hours=round(runif(n,3,40),1),sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))),anxiety_score=round(runif(n,1,10)),first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)),major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE),year=sample(1:4,n,replace=TRUE))
students$exam_score<-pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8))))

# Split into STEM and non-STEM
stem     <- students$exam_score[students$major == "STEM"]
non_stem <- students$exam_score[students$major != "STEM"]

cat("=== Group Summary ===\n")
cat("STEM:     n =", length(stem),
    " | mean =", round(mean(stem),2),
    " | SD =", round(sd(stem),2), "\n")
cat("Non-STEM: n =", length(non_stem),
    " | mean =", round(mean(non_stem),2),
    " | SD =", round(sd(non_stem),2), "\n\n")

# Two-sample t-test
result <- t.test(stem, non_stem)

cat("=== Two-Sample t-test ===\n")
cat("H0: STEM mean = Non-STEM mean\n")
cat("Ha: STEM mean > Non-STEM mean (committee's hypothesis)\n\n")
cat("t-statistic:", round(result$statistic, 3), "\n")
cat("p-value (two-tailed):", round(result$p.value, 4), "\n")
cat("95% CI for difference: (",
    round(result$conf.int[1], 2), ",",
    round(result$conf.int[2], 2), ")\n\n")

if (result$p.value < 0.05) {
  cat("Conclusion: Statistically significant difference detected (p < 0.05).\n")
} else {
  cat("Conclusion: No statistically significant difference (p >= 0.05).\n")
  cat("The data do not support the committee's hypothesis.\n")
}

# Visualize with boxplot
boxplot(exam_score ~ (major == "STEM"),
        data = students,
        names = c("Non-STEM", "STEM"),
        main = "Exam Scores by Major Group",
        ylab = "Exam Score",
        col = c("#F8BBD9", "#B2DFDB"),
        border = c("#C62828", "#00695C"))

Output will appear here...

Correlation Matrix — Which Variables Relate to Exam Score?

Compute a correlation matrix for all numeric variables. Which predictors correlate most strongly with exam_score? Are any predictors correlated with each other (multicollinearity)?

set.seed(2024); n <- 120
students <- data.frame(gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))),study_hours=round(runif(n,3,40),1),sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))),anxiety_score=round(runif(n,1,10)),first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)),major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE),year=sample(1:4,n,replace=TRUE))
students$exam_score<-pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8))))

num_vars <- c("gpa","study_hours","sleep_hours","anxiety_score","exam_score")
cor_matrix <- round(cor(students[, num_vars]), 3)

cat("=== Correlation Matrix ===\n\n")
print(cor_matrix)

cat("\n=== Correlations with exam_score ===\n")
exam_cors <- cor_matrix["exam_score", num_vars]
exam_cors <- sort(exam_cors, decreasing=TRUE)
for (v in names(exam_cors)) {
  r_val <- exam_cors[v]
  strength <- if (abs(r_val) > 0.7) "strong" else
              if (abs(r_val) > 0.4) "moderate" else "weak"
  direction <- if (r_val > 0) "positive" else "negative"
  cat(sprintf("  %-16s r = %6.3f  (%s %s)\n", v, r_val, strength, direction))
}

# Scatter plot matrix
pairs(students[, num_vars],
      main = "Scatter Plot Matrix",
      col = "#00695C", pch = 19, cex = 0.4,
      upper.panel = function(x, y) {
        points(x, y, col="#00695C", pch=19, cex=0.3)
      },
      lower.panel = function(x, y) {
        r <- round(cor(x, y), 2)
        usr <- par("usr"); on.exit(par(usr))
        par(usr=c(0,1,0,1))
        text(0.5, 0.5, r, cex=1.2, font=2,
             col=ifelse(abs(r)>0.5,"#C62828","#004D40"))
      })

Output will appear here...

What to look for: The largest |r| values with exam_score are your most important predictors. If two predictors strongly correlate with each other, including both in a regression may cause multicollinearity.

Multiple Regression — Predict Exam Score

Fit a multiple regression model: exam_score ~ study_hours + sleep_hours + anxiety_score + gpa. Interpret each coefficient. Check model fit and residuals.

set.seed(2024); n <- 120
students <- data.frame(gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))),study_hours=round(runif(n,3,40),1),sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))),anxiety_score=round(runif(n,1,10)),first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)),major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE),year=sample(1:4,n,replace=TRUE))
students$exam_score<-pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8))))

model <- lm(exam_score ~ study_hours + sleep_hours + anxiety_score + gpa,
            data = students)

cat("=== Multiple Regression: exam_score ~ all predictors ===\n\n")
cat("Coefficients:\n")
print(round(summary(model)$coefficients, 3))

r2     <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared
cat(sprintf("\nR² = %.3f  |  Adjusted R² = %.3f\n", r2, adj_r2))
cat("The model explains", round(adj_r2*100, 1),
    "% of variance in exam scores.\n\n")

coefs <- coef(model)
cat("=== Plain-English Interpretation ===\n")
cat(sprintf("- Each additional study hour adds %.2f points (holding others constant)\n",
            coefs["study_hours"]))
cat(sprintf("- Each additional sleep hour adds %.2f points\n",
            coefs["sleep_hours"]))
cat(sprintf("- Each point of anxiety reduces score by %.2f points\n",
            abs(coefs["anxiety_score"])))
cat(sprintf("- Each GPA point adds %.2f points\n", coefs["gpa"]))

# Residual plot
plot(model$fitted.values, model$residuals,
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals",
     col = "#00695C", pch = 19, cex = 0.7)
abline(h = 0, col = "#C62828", lwd = 2, lty = 2)

Output will appear here...

Write Your Report — Plain-English Conclusions

Write a 6-sentence plain-English report of your findings using cat(). Structure: (1) Data summary. (2) Normality finding. (3) STEM vs. non-STEM result. (4) Top correlates with exam score. (5) Regression model summary. (6) One actionable recommendation for the committee.

cat("==============================================\n")
cat("STUDENT PERFORMANCE ANALYSIS — FINAL REPORT\n")
cat("University Research Office, 2026\n")
cat("==============================================\n\n")

cat("1. DATA SUMMARY:\n")
cat("   We analyzed 120 students across four majors (STEM, Humanities,\n")
cat("   Business, and Arts). Variables included GPA, study hours per week,\n")
cat("   nightly sleep, anxiety score (1-10), and final exam performance.\n")
cat("   Study hours (mean ~21h/wk) showed the most variability across students.\n\n")

cat("2. NORMALITY:\n")
cat("   GPA was approximately normally distributed (mean=3.10, SD=0.49).\n")
cat("   The Q-Q plot showed points close to the diagonal, confirming\n")
cat("   normality. This supports the use of t-tests and regression.\n\n")

cat("3. STEM VS. NON-STEM:\n")
cat("   A two-sample t-test found no statistically significant difference\n")
cat("   in exam scores between STEM and non-STEM students (p > 0.05).\n")
cat("   The committee's hypothesis about STEM advantage was not supported.\n\n")

cat("4. CORRELATES OF EXAM PERFORMANCE:\n")
cat("   Study hours showed the strongest positive correlation with exam\n")
cat("   scores (r ≈ 0.65). Anxiety was negatively correlated (r ≈ -0.45).\n")
cat("   GPA was a moderate positive predictor. Sleep hours also correlated\n")
cat("   positively with exam performance.\n\n")

cat("5. REGRESSION MODEL:\n")
cat("   A multiple regression model (exam ~ study_hours + sleep + anxiety + GPA)\n")
cat("   explained approximately 60-65% of variance in exam scores (adj. R²).\n")
cat("   All four predictors were statistically significant. Each additional\n")
cat("   study hour added ~0.4 points; each anxiety point reduced score by ~1.2.\n\n")

cat("6. RECOMMENDATION:\n")
cat("   Interventions targeting study hours and anxiety management are most\n")
cat("   likely to improve student outcomes. Sleep hygiene programs may also\n")
cat("   provide meaningful benefit. Major-based interventions are not supported\n")
cat("   by the current data.\n")
cat("==============================================\n")

Output will appear here...

Make it yours: Edit the report based on what you actually found in Tasks 1–5. Did you see different results? Update the numbers and interpretations to reflect your actual output.

View Sample Solution: All Tasks Together

This shows one complete approach to all 6 tasks. Your analysis may differ — that’s expected! Statistics involves judgment.

# ============================================ # COMPLETE CAPSTONE SOLUTION # ============================================ set.seed(2024); n <- 120 students <- data.frame( gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))), study_hours=round(runif(n,3,40),1), sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))), anxiety_score=round(runif(n,1,10)), first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)), major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE), year=sample(1:4,n,replace=TRUE)) students$exam_score<-pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8)))) # Task 1: Descriptive stats cat("=== Task 1: Descriptive Stats ===\n") num_vars <- c("gpa","study_hours","sleep_hours","anxiety_score","exam_score") print(summary(students[, num_vars])) # Task 2: Normality cat("\n=== Task 2: GPA Normality ===\n") cat("GPA mean:", round(mean(students$gpa),3), "| SD:", round(sd(students$gpa),3),"\n") # Task 3: T-test cat("\n=== Task 3: STEM vs Non-STEM ===\n") result <- t.test(exam_score ~ (major=="STEM"), data=students) cat("p-value:", round(result$p.value, 4), "\n") # Task 4: Correlations cat("\n=== Task 4: Correlations with exam_score ===\n") cors <- sort(cor(students[,num_vars])["exam_score",], decreasing=TRUE) print(round(cors, 3)) # Task 5: Multiple regression cat("\n=== Task 5: Multiple Regression ===\n") model <- lm(exam_score ~ study_hours + sleep_hours + anxiety_score + gpa, data=students) cat("Adj R²:", round(summary(model)$adj.r.squared, 3), "\n") print(round(coef(model), 3))

Final Brain Break

You just did a complete statistical analysis — the same workflow used in real research, policy analysis, and data science.

Reflect: Which part of the analysis felt most uncertain? Statistical analysis always involves judgment calls — which variables to include, how to interpret results, what to recommend. The numbers only tell part of the story.

Key Takeaway

A complete statistical analysis follows a workflow: (1) explore and describe, (2) check assumptions, (3) test hypotheses, (4) examine relationships, (5) build a model, (6) communicate findings. The technical skills matter — but so does the plain-English interpretation that makes results actionable.

You’ve Completed Statistics with R!

You can now run complete statistical analyses in R — from raw data to written conclusions — using the tools that statisticians, researchers, and data scientists use every day.

You’ve mastered: descriptive statistics, probability distributions, the Central Limit Theorem, confidence intervals, hypothesis testing, correlation, simple and multiple regression, and full analysis workflows. All in R. All in the browser.

R Practice Labs

10 labs focused on R programming skills — data frames, dplyr, ggplot2, and more. The perfect complement to this course.

Open R Practice Labs →

Data Analyst Course

Apply your statistical and R skills to real-world data analysis scenarios in the full Data Analyst learning path.

Open Data Analyst Course →

Introduction to Statistics

Want to deepen the theory behind what you just practiced? The Introduction to Statistics course covers the conceptual foundations in depth.

Open Intro to Statistics →

Back to Statistics with R Course Home →

← Module 7: Multiple Regression Module 8 of 8 — Complete! Course Home