Sampling & The Central Limit Theorem

Why statistics works — the most important theorem

← Module 2: Probability Distributions Module 3 of 8 Module 4: Confidence Intervals →

Loading R... (first load takes ~15 seconds)

Before You Start

What you need: Modules 1 and 2 completed, or familiarity with mean, standard deviation, and the normal distribution.

What you’ll learn: The difference between a population and a sample. How to simulate sampling in R. The Central Limit Theorem — why it matters, and watching it happen live. Standard error and what it measures.

The Concept: Populations, Samples, and the CLT

Population vs. Sample:A population is every case you care about. A sample is the subset you actually measure. The sample mean (x̄) estimates the population mean (μ).

The Central Limit Theorem (CLT) says: regardless of the population’s shape, the distribution of sample means approaches a normal distribution as sample size n increases. This is why the normal distribution shows up everywhere in inferential statistics — it describes the behavior of sample means, even when the original data is not normal.

CLT in plain English: Take many samples from any population. Calculate the mean of each sample. Plot all those sample means. That plot will look normal — even if the original population is skewed, bimodal, or uniform.

Standard Error (SE) measures how much the sample mean varies from sample to sample: SE = σ / √n. Larger n = smaller SE = more precise estimates.

Key Formula: Standard Error

SE = σ / √n

σ = population standard deviation | n = sample size

x̄ ~ N(μ, SE²) as n → ∞

The distribution of sample means is normal with mean μ and SE as its standard deviation

In R — Worked Example (read-only)

We start with a skewed population (exponential distribution), take 1000 samples of n=30, and watch the sample means become normally distributed.

# Demonstrating the Central Limit Theorem set.seed(123) # Start with a SKEWED population (exponential distribution) population <- rexp(10000, rate=0.5) cat("Population mean:", round(mean(population), 3), "\n") cat("Population shape: skewed right\n\n") # Take 1000 samples of size n=30, record each sample mean sample_means_30 <- replicate(1000, mean(sample(population, 30))) cat("Distribution of sample means (n=30):\n") cat("Mean of sample means:", round(mean(sample_means_30), 3), "\n") cat("SE of sample means: ", round(sd(sample_means_30), 3), "\n") cat("Theoretical SE: ", round(sd(population)/sqrt(30), 3), "\n")

Your Turn

Exercise 1 — CLT Simulation: Different Sample Sizes

Repeat the CLT simulation with n = 5, n = 30, and n = 100. For each, collect 1000 sample means. Notice how the distribution becomes more normal and tighter as n increases.

set.seed(42)
# Skewed population
population <- rexp(10000, rate = 0.5)
true_mean <- mean(population)
true_se_fn <- function(n) sd(population) / sqrt(n)

# Collect 1000 sample means for each sample size
means_n5   <- replicate(1000, mean(sample(population, 5)))
means_n30  <- replicate(1000, mean(sample(population, 30)))
means_n100 <- replicate(1000, mean(sample(population, 100)))

cat("True population mean:", round(true_mean, 3), "\n\n")

cat("n = 5:\n")
cat("  Mean of sample means:", round(mean(means_n5), 3), "\n")
cat("  SD of sample means:  ", round(sd(means_n5), 3),
    " (theoretical SE:", round(true_se_fn(5), 3), ")\n\n")

cat("n = 30:\n")
cat("  Mean of sample means:", round(mean(means_n30), 3), "\n")
cat("  SD of sample means:  ", round(sd(means_n30), 3),
    " (theoretical SE:", round(true_se_fn(30), 3), ")\n\n")

cat("n = 100:\n")
cat("  Mean of sample means:", round(mean(means_n100), 3), "\n")
cat("  SD of sample means:  ", round(sd(means_n100), 3),
    " (theoretical SE:", round(true_se_fn(100), 3), ")\n\n")

cat("Key result: As n increases, SE decreases and distributions become more normal!")

Output will appear here...

What to notice: All three "mean of sample means" are close to the true population mean. But the SD decreases as n grows — larger samples give more precise estimates.

Exercise 2 — Visualize the CLT

Plot three histograms of sample means for n = 5, 30, 100. Watch the distribution transform from skewed to approximately normal as n increases. This is the CLT in action.

set.seed(42)
population <- rexp(10000, rate = 0.5)
true_mean  <- mean(population)

means_n5   <- replicate(1000, mean(sample(population, 5)))
means_n30  <- replicate(1000, mean(sample(population, 30)))
means_n100 <- replicate(1000, mean(sample(population, 100)))

# Plot all three side by side
par(mfrow = c(1, 3))

hist(means_n5,
     main = "Sample Means\nn = 5",
     xlab = "Sample Mean", col = "#FFCDD2", border = "white",
     breaks = 30, freq = FALSE)
abline(v = true_mean, col = "#C62828", lwd = 2)

hist(means_n30,
     main = "Sample Means\nn = 30",
     xlab = "Sample Mean", col = "#B2EBF2", border = "white",
     breaks = 30, freq = FALSE)
abline(v = true_mean, col = "#00695C", lwd = 2)

hist(means_n100,
     main = "Sample Means\nn = 100",
     xlab = "Sample Mean", col = "#B2DFDB", border = "white",
     breaks = 30, freq = FALSE)
abline(v = true_mean, col = "#004D40", lwd = 2)

par(mfrow = c(1, 1))

Output will appear here...

What to see: The n=5 histogram is right-skewed (reflecting the skewed population). The n=30 and n=100 histograms look increasingly bell-shaped — the CLT at work!

Exercise 3 — Standard Error in Practice

A school claims their students’ average test score is 75. You survey a random sample of 36 students and get a mean of 71. The population SD is 12. How unusual is this result if the school’s claim is true?

# Known values
mu_claimed <- 75     # school's claimed population mean
n <- 36              # sample size
x_bar <- 71          # our sample mean
sigma <- 12          # known population SD

# 1. Standard error of the mean
SE <- sigma / sqrt(n)
cat("Standard Error:", round(SE, 3), "\n")
cat("(How much sample means vary from sample to sample)\n\n")

# 2. Z-score: how many SEs is our sample mean from the claimed mean?
z <- (x_bar - mu_claimed) / SE
cat("Z-score:", round(z, 3), "\n")
cat("Our sample mean is", round(abs(z), 2), "standard errors below the claim.\n\n")

# 3. P-value: how likely is a sample mean this low if the claim is true?
p_value <- pnorm(z)  # P(X_bar <= 71) when true mean = 75
cat("P(sample mean <= 71 | true mean = 75):", round(p_value, 4), "\n\n")

# Interpret
if (p_value < 0.05) {
  cat("This result is statistically unusual (p < 0.05).\n")
  cat("It casts doubt on the school's claim.\n")
} else {
  cat("This result could reasonably occur by chance.\n")
  cat("Not enough evidence to challenge the school's claim.\n")
}

Output will appear here...

This is a preview of hypothesis testing (Module 5). The z-score and p-value are your tools for deciding when a result is "surprising enough" to be statistically meaningful.

Brain Break

The CLT is the mathematical reason statistical inference works at all. It lets us use the normal distribution even when our data isn’t normal.

Think about it: If you quadruple your sample size (n → 4n), what happens to the standard error? It gets cut in half! (SE = σ/√n, so √4n = 2√n.) More data = more precision.

Key Takeaway

The CLT guarantees that sample means are approximately normally distributed for large enough n, regardless of the population shape. This is the foundation of all hypothesis testing and confidence intervals. Standard error SE = σ/√n tells you the precision of your sample mean.

Module 3 Complete!

You’ve witnessed the most important theorem in statistics — live in R. Now we’ll use the CLT to build confidence intervals and quantify our uncertainty.

Continue to Module 4: Confidence Intervals →

← Module 2: Probability Distributions Module 3 of 8 Module 4: Confidence Intervals →