Learn Without Walls
← Back to Statistics with R
Module 8 of 8 — Statistics with R — CAPSTONE

Full Statistical Analysis in R

Everything together — a complete analysis workflow

← Module 7: Multiple Regression Module 8 of 8 — Final! ✅ Course Home
⏳ Loading R... (first load takes ~15 seconds)

📌 Before You Start

What you need: All 7 previous modules completed (or strong familiarity with descriptive stats, distributions, CLT, confidence intervals, hypothesis testing, and regression).

What you’ll do: A complete statistical analysis from raw data to written conclusion, applying every technique from this course in a realistic research scenario.

📖 Everything You’ve Built — Applied Here

This capstone integrates all 7 modules. Here’s what you’ll apply:

M1 Descriptive stats & spread
M2 Probability distributions
M3 Sampling & standard error
M4 Confidence intervals
M5 Hypothesis testing
M6 Correlation & simple regression
M7 Multiple regression & residuals

🏛️ Your Scenario: University Research Analyst

You are a research analyst at a university. A faculty committee wants to understand what factors predict student exam performance. They’ve provided data on 120 students including GPA, study habits, sleep, anxiety, major, year, and first-generation status.

Your job: produce a complete statistical report. The committee needs actionable insights — not just numbers. Work through all 6 tasks in order, then write your conclusion in Task 6.

Important: Run Task 0 (Setup) first. The dataset persists across tasks within the same browser session.

0

Setup — Create the Student Dataset

Run this first. It creates the students data frame with 120 observations and 8 variables. You must run this before any other task.

Run this first to create the dataset!

1

Descriptive Statistics — Get to Know the Data

Compute summary statistics for all numeric variables. Which variables have the most variability (highest CV = SD/mean)? Are any variables notably skewed?

Output will appear here...
💡 Look for: Which variable has the largest CV? That’s the most variable relative to its mean. Are mean and median close for each variable (symmetric) or far apart (skewed)?
2

Normality Check — Is GPA Normally Distributed?

Produce a histogram of GPA and a Q-Q plot. If GPA is approximately normal, the Q-Q plot points should lie close to the diagonal line.

Output will appear here...
3

Hypothesis Test — Do STEM Students Score Higher?

The committee hypothesizes that STEM students score higher on exams. Test this formally: H&sub0;: STEM mean = non-STEM mean. Report t-statistic, p-value, and plain-English conclusion.

Output will appear here...
4

Correlation Matrix — Which Variables Relate to Exam Score?

Compute a correlation matrix for all numeric variables. Which predictors correlate most strongly with exam_score? Are any predictors correlated with each other (multicollinearity)?

Output will appear here...
💡 What to look for: The largest |r| values with exam_score are your most important predictors. If two predictors strongly correlate with each other, including both in a regression may cause multicollinearity.
5

Multiple Regression — Predict Exam Score

Fit a multiple regression model: exam_score ~ study_hours + sleep_hours + anxiety_score + gpa. Interpret each coefficient. Check model fit and residuals.

Output will appear here...
6

Write Your Report — Plain-English Conclusions

Write a 6-sentence plain-English report of your findings using cat(). Structure: (1) Data summary. (2) Normality finding. (3) STEM vs. non-STEM result. (4) Top correlates with exam score. (5) Regression model summary. (6) One actionable recommendation for the committee.

Output will appear here...
💡 Make it yours: Edit the report based on what you actually found in Tasks 1–5. Did you see different results? Update the numbers and interpretations to reflect your actual output.
💡 View Sample Solution: All Tasks Together

This shows one complete approach to all 6 tasks. Your analysis may differ — that’s expected! Statistics involves judgment.

# ============================================ # COMPLETE CAPSTONE SOLUTION # ============================================ set.seed(2024); n <- 120 students <- data.frame( gpa=pmax(1.5,pmin(4.0,rnorm(n,3.1,0.5))), study_hours=round(runif(n,3,40),1), sleep_hours=pmax(4,pmin(10,rnorm(n,7,1.2))), anxiety_score=round(runif(n,1,10)), first_gen=sample(c(TRUE,FALSE),n,replace=TRUE,prob=c(0.35,0.65)), major=sample(c("STEM","Humanities","Business","Arts"),n,replace=TRUE), year=sample(1:4,n,replace=TRUE)) students$exam_score<-pmax(40,pmin(100,round(55+0.3*students$gpa*10+0.4*students$study_hours+1.5*students$sleep_hours-1.2*students$anxiety_score+rnorm(n,0,8)))) # Task 1: Descriptive stats cat("=== Task 1: Descriptive Stats ===\n") num_vars <- c("gpa","study_hours","sleep_hours","anxiety_score","exam_score") print(summary(students[, num_vars])) # Task 2: Normality cat("\n=== Task 2: GPA Normality ===\n") cat("GPA mean:", round(mean(students$gpa),3), "| SD:", round(sd(students$gpa),3),"\n") # Task 3: T-test cat("\n=== Task 3: STEM vs Non-STEM ===\n") result <- t.test(exam_score ~ (major=="STEM"), data=students) cat("p-value:", round(result$p.value, 4), "\n") # Task 4: Correlations cat("\n=== Task 4: Correlations with exam_score ===\n") cors <- sort(cor(students[,num_vars])["exam_score",], decreasing=TRUE) print(round(cors, 3)) # Task 5: Multiple regression cat("\n=== Task 5: Multiple Regression ===\n") model <- lm(exam_score ~ study_hours + sleep_hours + anxiety_score + gpa, data=students) cat("Adj R²:", round(summary(model)$adj.r.squared, 3), "\n") print(round(coef(model), 3))

🧠 Final Brain Break

You just did a complete statistical analysis — the same workflow used in real research, policy analysis, and data science.

Reflect: Which part of the analysis felt most uncertain? Statistical analysis always involves judgment calls — which variables to include, how to interpret results, what to recommend. The numbers only tell part of the story.

✅ Key Takeaway

A complete statistical analysis follows a workflow: (1) explore and describe, (2) check assumptions, (3) test hypotheses, (4) examine relationships, (5) build a model, (6) communicate findings. The technical skills matter — but so does the plain-English interpretation that makes results actionable.

🎓 You’ve Completed Statistics with R!

You can now run complete statistical analyses in R — from raw data to written conclusions — using the tools that statisticians, researchers, and data scientists use every day.

You’ve mastered: descriptive statistics, probability distributions, the Central Limit Theorem, confidence intervals, hypothesis testing, correlation, simple and multiple regression, and full analysis workflows. All in R. All in the browser.

🔬 R Practice Labs

10 labs focused on R programming skills — data frames, dplyr, ggplot2, and more. The perfect complement to this course.

Open R Practice Labs →

📊 Data Analyst Course

Apply your statistical and R skills to real-world data analysis scenarios in the full Data Analyst learning path.

Open Data Analyst Course →

📖 Introduction to Statistics

Want to deepen the theory behind what you just practiced? The Introduction to Statistics course covers the conceptual foundations in depth.

Open Intro to Statistics →

Back to Statistics with R Course Home →

← Module 7: Multiple Regression Module 8 of 8 — Complete! ✅ Course Home