Hypothesis Testing

Testing claims with data — the core of inferential statistics

← Module 4: Confidence Intervals Module 5 of 8 Module 6: Correlation & Regression →

Loading R... (first load takes ~15 seconds)

Before You Start

What you need: Modules 3 (CLT, standard error) and 4 (confidence intervals) completed. Understanding of p-values at a basic conceptual level is helpful.

What you’ll learn: The logic of null hypothesis significance testing (H&sub0; vs. Hα). What p-values actually mean. How to run one-sample and two-sample t-tests in R. Type I and Type II errors. Why statistical significance doesn’t equal practical significance.

The Concept: Hypothesis Testing

Hypothesis testing gives us a formal framework for asking: "Is this result surprising enough to be convincing evidence against the default assumption?"

H&sub0; (Null Hypothesis) — the default claim. "No effect, no difference, status quo." We assume H&sub0; is true and see if the data contradicts it.
Hα (Alternative Hypothesis) — what we’re testing for. "There IS an effect / difference."
p-value — P(observing results this extreme, or more, IF H&sub0; is true). Small p-value = surprising result under H&sub0;.
α (significance level) — our threshold. Usually 0.05. If p < α, we reject H&sub0;.

	H&sub0; Is True	H&sub0; Is False
Reject H&sub0;	Type I Error (α) — false positive	Correct!
Fail to Reject H&sub0;	Correct!	Type II Error (β) — false negative

The t-statistic

t = (x̄ − μ&sub0;) / (s / √n)

x̄ = sample mean | μ&sub0; = hypothesized mean | s = sample SD | n = sample size

Reject H&sub0; if |t| > t* or p-value < α

In R: t.test(data, mu = μ&sub0;) for one-sample; t.test(x, y) for two-sample

In R — Worked Example (read-only)

A one-sample t-test asking whether a population mean differs from 70. R reports the t-statistic, p-value, and confidence interval all at once.

# One-sample t-test # H0: population mean = 70 # Ha: population mean ≠ 70 (two-tailed) set.seed(42) sample_data <- rnorm(30, mean=73, sd=10) result <- t.test(sample_data, mu=70) cat("=== One-Sample t-test ===\n") cat("t-statistic:", round(result$statistic, 3), "\n") cat("p-value: ", round(result$p.value, 4), "\n") cat("95% CI: (", round(result$conf.int[1],2), ",", round(result$conf.int[2],2), ")\n") cat("\nConclusion:", ifelse(result$p.value < 0.05, "Reject H0 (p < 0.05)", "Fail to reject H0 (p >= 0.05)"), "\n")

Your Turn

Exercise 1 — One-Sample t-test: Coffee Shop

A coffee shop claims their drinks are 12 oz on average. A consumer group samples 20 drinks and finds mean = 11.6 oz, SD = 0.8 oz. Test H&sub0;: μ = 12 at α = 0.05. Report your conclusion.

set.seed(101)
# Simulate data consistent with the sample statistics
# (mean ≈ 11.6, sd ≈ 0.8, n = 20)
set.seed(101)
drinks <- rnorm(20, mean = 11.6, sd = 0.8)

# One-sample t-test: H0: mu = 12
result <- t.test(drinks, mu = 12)

cat("=== Coffee Shop Drink Volume Test ===\n")
cat("H0: Average drink = 12 oz\n")
cat("Ha: Average drink ≠ 12 oz\n\n")
cat("Sample mean:", round(mean(drinks), 3), "oz\n")
cat("Sample SD:  ", round(sd(drinks), 3), "oz\n\n")
cat("t-statistic:", round(result$statistic, 3), "\n")
cat("df:         ", result$parameter, "\n")
cat("p-value:    ", round(result$p.value, 4), "\n")
cat("95% CI:     (", round(result$conf.int[1], 3),
    ",", round(result$conf.int[2], 3), ")\n\n")

alpha <- 0.05
if (result$p.value < alpha) {
  cat("Decision: REJECT H0 (p =", round(result$p.value, 4), "< 0.05)\n")
  cat("Conclusion: There IS statistically significant evidence that\n")
  cat("the drinks are NOT 12 oz on average.\n")
} else {
  cat("Decision: FAIL TO REJECT H0 (p =", round(result$p.value, 4), ">= 0.05)\n")
  cat("Conclusion: Insufficient evidence to challenge the 12 oz claim.\n")
}

Output will appear here...

Notice: Rejecting H&sub0; means the data are surprising IF the claim is true. It doesn’t prove the claim is false with certainty — it just says the evidence is strong enough to be skeptical.

Exercise 2 — Two-Sample t-test: Teaching Methods

Two teaching methods are compared. Method A (n=25) averages 82 points. Method B (n=25) averages 78 points. Is the difference statistically significant at α = 0.05?

set.seed(55)
# Generate two groups with the given means
# (We'll use sd=10 for both — a reasonable assumption)
group_a <- rnorm(25, mean = 82, sd = 10)
group_b <- rnorm(25, mean = 78, sd = 10)

cat("=== Two Teaching Methods: Score Comparison ===\n\n")
cat("Method A: mean =", round(mean(group_a), 2),
    " | SD =", round(sd(group_a), 2), "\n")
cat("Method B: mean =", round(mean(group_b), 2),
    " | SD =", round(sd(group_b), 2), "\n\n")

# Two-sample t-test (Welch's by default — doesn't assume equal variances)
result <- t.test(group_a, group_b)

cat("H0: mu_A = mu_B (no difference between methods)\n")
cat("Ha: mu_A ≠ mu_B (methods differ)\n\n")
cat("t-statistic:", round(result$statistic, 3), "\n")
cat("p-value:    ", round(result$p.value, 4), "\n")
cat("95% CI for difference: (",
    round(result$conf.int[1], 2), ",",
    round(result$conf.int[2], 2), ")\n\n")

if (result$p.value < 0.05) {
  cat("Decision: REJECT H0 — Methods are significantly different.\n")
} else {
  cat("Decision: FAIL TO REJECT H0 — Insufficient evidence of a difference.\n")
}
cat("\nNote: With 4-point difference and SD~10, a sample of 25 may not\n")
cat("have enough power to detect this difference reliably.\n")

Output will appear here...

Run it a few times: Delete set.seed(55) and run several times. Results may flip between significant and not-significant. Small samples + small effects = inconsistent results. That’s statistical power.

Exercise 3 — Type I Error Simulation

Run 200 t-tests where H&sub0; is TRUE (both groups sampled from the same distribution). Count how many return p < 0.05. Should be about 10 (5% of 200) — those are false positives.

set.seed(2025)
n_tests <- 200
p_values <- numeric(n_tests)

for (i in 1:n_tests) {
  # Both groups from SAME distribution — H0 is TRUE
  group1 <- rnorm(30, mean = 50, sd = 10)
  group2 <- rnorm(30, mean = 50, sd = 10)
  p_values[i] <- t.test(group1, group2)$p.value
}

# Count false positives
false_positives <- sum(p_values < 0.05)
cat("=== Type I Error Simulation ===\n\n")
cat("Number of tests run:", n_tests, "\n")
cat("H0 is TRUE in all tests (same population)\n\n")
cat("False positives (p < 0.05):", false_positives, "\n")
cat("False positive rate:", round(false_positives / n_tests * 100, 1), "%\n")
cat("Expected rate: ~5% (that's what alpha = 0.05 means!)\n\n")

# Histogram of all p-values
hist(p_values,
     main = "Distribution of p-values when H0 is True",
     xlab = "p-value", col = "#B2DFDB", border = "white",
     breaks = 20, freq = TRUE)
abline(v = 0.05, col = "#C62828", lwd = 2, lty = 2)
text(0.05, par("usr")[4] * 0.9, " alpha = 0.05",
     col = "#C62828", font = 2, adj = 0)

Output will appear here...

Key insight: When H&sub0; is true, p-values are uniformly distributed between 0 and 1. About 5% fall below 0.05 by pure chance. This is the Type I error rate — the price you pay for using α = 0.05.

Brain Break

The p-value is one of the most misunderstood concepts in science. It does NOT measure the probability that H&sub0; is true.

Remember: p-value = P(data this extreme | H&sub0; true). A small p-value means "if the null were true, this result would be surprising." That’s evidence against H&sub0; — not proof.

Key Takeaway

p-value < 0.05 means the result is unlikely under H&sub0; — not that H&sub0; is definitely false, and not that the effect is large or important. Statistical significance ≠ practical significance. Always report effect sizes and confidence intervals alongside p-values.

Module 5 Complete!

You now understand hypothesis testing — the foundation of scientific inference. You can run t-tests in R, interpret p-values correctly, and understand what false positives mean. Next: relationships between variables.

Continue to Module 6: Correlation & Simple Regression →

← Module 4: Confidence Intervals Module 5 of 8 Module 6: Correlation & Regression →