Statistical Analysis

From data to insights — the core of R

← Lab 6: ggplot2 Lab 7 of 10 Lab 8: Tidyr →

Loading R... (first load takes ~15 seconds)

Concept Recap

R was built for statistics. Its core functions cover everything from descriptive stats to hypothesis testing:

Descriptive: mean(), median(), sd(), var(), summary()
Correlation: cor(x, y) — Pearson r between −1 and +1
t-test: t.test(x, y) — compare two group means, check p-value
Linear regression: lm(y ~ x, data=df) then summary(model) for R², coefficients, and p-values
Simulation: rnorm(n, mean, sd), runif(n, min, max), set.seed() for reproducibility

A p-value < 0.05 is conventionally called “statistically significant” — it means the result is unlikely under the null hypothesis. R² tells you how much variance in y is explained by x.

Worked Example

Simulating data, computing correlation, and fitting a linear regression:

set.seed(42) n <- 50 study_hours <- runif(n, 1, 8) exam_score <- pmin(pmax(50 + 5 * study_hours + rnorm(n, 0, 8), 0), 100) df <- data.frame(study_hours, exam_score) cat("Correlation:", round(cor(study_hours, exam_score), 3), "\n") model <- lm(exam_score ~ study_hours, data = df) cat("Intercept:", round(coef(model)[1], 2), "\n") cat("Slope:", round(coef(model)[2], 2), "\n") cat("R-squared:", round(summary(model)$r.squared, 3), "\n")

Guided

Exercise 1 — Two-Group Comparison & t-Test

Run this complete group comparison. Study the t-test output and Cohen’s d calculation, then try changing the group means or SDs to see how the p-value responds.

set.seed(99)
group_a <- rnorm(30, mean = 75, sd = 10)  # Treatment group
group_b <- rnorm(30, mean = 70, sd = 10)  # Control group

cat("Group A: mean =", round(mean(group_a), 2), ", sd =", round(sd(group_a), 2), "\n")
cat("Group B: mean =", round(mean(group_b), 2), ", sd =", round(sd(group_b), 2), "\n")

# Independent samples t-test
result <- t.test(group_a, group_b)
cat("\nt-test results:\n")
cat("  t =", round(result$statistic, 3), "\n")
cat("  df =", round(result$parameter, 1), "\n")
cat("  p-value =", round(result$p.value, 4), "\n")
cat("  Significant (p<0.05):", result$p.value < 0.05, "\n")

# Cohen's d effect size
pooled_sd <- sqrt((sd(group_a)^2 + sd(group_b)^2) / 2)
cohens_d  <- (mean(group_a) - mean(group_b)) / pooled_sd
cat("\nCohen's d =", round(cohens_d, 3), "\n")
cat("Effect size:", ifelse(abs(cohens_d) < 0.2, "negligible",
                   ifelse(abs(cohens_d) < 0.5, "small",
                   ifelse(abs(cohens_d) < 0.8, "medium", "large"))), "\n")

Output will appear here...

Hint: Cohen’s d measures practical significance (effect size), not just statistical significance. A small p-value with a tiny Cohen’s d means the difference is statistically detectable but may not matter in practice.

Independent

Exercise 2 — Multiple Regression

Create a 100-student dataset with study_hours, sleep_hours, and exam_score. Compute a correlation matrix, fit a multiple regression, and interpret the coefficients in plain English comments.

set.seed(7)
n <- 100
study_hours <- runif(n, 2, 10)
sleep_hours <- runif(n, 4, 9)
exam_score  <- pmin(pmax(
  40 + 4 * study_hours + 2 * sleep_hours + rnorm(n, 0, 8),
  0), 100)
students <- data.frame(study_hours, sleep_hours, exam_score)

# 1. Correlation matrix
cat("Correlation matrix:\n")
print(round(cor(students), 3))

# 2. Multiple regression
model <- lm(exam_score ~ study_hours + sleep_hours, data = students)
cat("\nModel summary:\n")
print(summary(model))

# 3. Your plain-English interpretation (write comments):
# The study_hours coefficient is ___: each additional study hour increases score by ___ points.
# The sleep_hours coefficient is ___: ...
# R-squared = ___: the model explains ___% of variance in exam scores.

Output will appear here...

Hint: cor(df) computes pairwise correlations for all numeric columns. In summary(model), look at the Estimate column for coefficients and Pr(>|t|) for p-values. Stars (*) indicate significance levels.

Challenge

Exercise 3 — Coin Flip Simulation & False Positives

Simulate coin flipping to understand false positive rates: (1) flip a fair coin 30 times using rbinom(30, 1, 0.5) and test for fairness with binom.test(), (2) repeat this 100 times, and (3) report the percentage of trials where p < 0.05 even though the coin is fair.

set.seed(42)

# Single trial: flip 30 times, test for fairness
flips <- rbinom(30, 1, 0.5)
result <- binom.test(sum(flips), 30, p = 0.5)
cat("Single trial: heads =", sum(flips), ", p-value =", round(result$p.value, 4), "\n")

# Simulation: repeat 100 times
n_trials <- 100
p_values <- numeric(n_trials)
for (i in 1:n_trials) {
  flips_i    <- rbinom(30, 1, 0.5)
  test_i     <- binom.test(sum(flips_i), 30, p = 0.5)
  p_values[i] <- test_i$p.value
}

false_positives <- sum(p_values < 0.05)
cat("\nOut of", n_trials, "trials with a FAIR coin:\n")
cat("  False positives (p<0.05):", false_positives, "\n")
cat("  False positive rate:", round(false_positives / n_trials * 100, 1), "%\n")
cat("\n(Expected ~5% by definition of p=0.05)\n")

Output will appear here...

Hint: This demonstrates the Type I error rate. At a significance level of 0.05, about 5% of tests on truly null effects will appear “significant” by chance alone. This is why multiple testing correction matters!

Mini Project — Salary Analysis

Department Salary Study

Analyze a simulated salary dataset: compute overall and by-department stats, run a t-test between two departments, fit a regression of salary on years of experience, and write a plain-English summary using cat().

set.seed(2024)
n <- 60
salary_data <- data.frame(
  dept            = rep(c("Engineering","Marketing","HR"), each=20),
  years_exp       = round(runif(n, 1, 15)),
  salary          = c(
    round(rnorm(20, 95000, 12000)),
    round(rnorm(20, 72000, 9000)),
    round(rnorm(20, 65000, 8000))
  )
)

# 1. Overall summary stats
cat("=== Overall Salary Stats ===\n")
cat("Mean:", round(mean(salary_data$salary)), "\n")
cat("Median:", round(median(salary_data$salary)), "\n")
cat("SD:", round(sd(salary_data$salary)), "\n\n")

# 2. By-department breakdown
cat("=== By Department ===\n")
dept_stats <- tapply(salary_data$salary, salary_data$dept, function(x)
  c(mean=round(mean(x)), sd=round(sd(x)), n=length(x)))
for (dept in names(dept_stats)) {
  cat(dept, "- mean:", dept_stats[[dept]]["mean"],
      "  sd:", dept_stats[[dept]]["sd"], "\n")
}

# 3. t-test: Engineering vs Marketing
eng <- salary_data$salary[salary_data$dept == "Engineering"]
mkt <- salary_data$salary[salary_data$dept == "Marketing"]
ttest <- t.test(eng, mkt)
cat("\nt-test (Eng vs Marketing): p =", round(ttest$p.value, 4), "\n")

# 4. Regression: salary ~ years_experience
model <- lm(salary ~ years_exp, data = salary_data)
cat("\nRegression salary ~ years_exp:\n")
cat("  Slope:", round(coef(model)[2]), "per year\n")
cat("  R-squared:", round(summary(model)$r.squared, 3), "\n")

# 5. Plain-English summary
cat("\n=== Summary ===\n")
cat("The average salary across all departments is $",
    format(round(mean(salary_data$salary)), big.mark=","), ".\n", sep="")
cat("Engineering earns significantly more than Marketing (p =",
    round(ttest$p.value, 3), ").\n")
cat("Each additional year of experience adds approximately $",
    round(coef(model)[2]), "in salary.\n", sep="")

Output will appear here...

Hint: tapply(salary, dept, function(x) ...) applies a function to salary values grouped by department. The result is a list — access elements with dept_stats[["Engineering"]].

Lab 7 Complete!

You’ve run t-tests, fitted regression models, interpreted R² and p-values, and understood simulation-based false positive rates. These are the statistical foundations of evidence-based analysis.

Continue to Lab 8: Tidyr & Data Reshaping →

← Lab 6: ggplot2 Lab 7 of 10 Lab 8: Tidyr →