Sampling & The Central Limit Theorem
Why statistics works — the most important theorem
📌 Before You Start
What you need: Modules 1 and 2 completed, or familiarity with mean, standard deviation, and the normal distribution.
What you’ll learn: The difference between a population and a sample. How to simulate sampling in R. The Central Limit Theorem — why it matters, and watching it happen live. Standard error and what it measures.
📖 The Concept: Populations, Samples, and the CLT
Population vs. Sample: A population is every case you care about. A sample is the subset you actually measure. The sample mean (x̄) estimates the population mean (μ).
The Central Limit Theorem (CLT) says: regardless of the population’s shape, the distribution of sample means approaches a normal distribution as sample size n increases. This is why the normal distribution shows up everywhere in inferential statistics — it describes the behavior of sample means, even when the original data is not normal.
CLT in plain English: Take many samples from any population. Calculate the mean of each sample. Plot all those sample means. That plot will look normal — even if the original population is skewed, bimodal, or uniform.
Standard Error (SE) measures how much the sample mean varies from sample to sample: SE = σ / √n. Larger n = smaller SE = more precise estimates.
🔢 Key Formula: Standard Error
σ = population standard deviation | n = sample size
The distribution of sample means is normal with mean μ and SE as its standard deviation
💻 In R — Worked Example (read-only)
We start with a skewed population (exponential distribution), take 1000 samples of n=30, and watch the sample means become normally distributed.
🖐️ Your Turn
Exercise 1 — CLT Simulation: Different Sample Sizes
Repeat the CLT simulation with n = 5, n = 30, and n = 100. For each, collect 1000 sample means. Notice how the distribution becomes more normal and tighter as n increases.
Exercise 2 — Visualize the CLT
Plot three histograms of sample means for n = 5, 30, 100. Watch the distribution transform from skewed to approximately normal as n increases. This is the CLT in action.
Exercise 3 — Standard Error in Practice
A school claims their students’ average test score is 75. You survey a random sample of 36 students and get a mean of 71. The population SD is 12. How unusual is this result if the school’s claim is true?
🧠 Brain Break
The CLT is the mathematical reason statistical inference works at all. It lets us use the normal distribution even when our data isn’t normal.
Think about it: If you quadruple your sample size (n → 4n), what happens to the standard error? It gets cut in half! (SE = σ/√n, so √4n = 2√n.) More data = more precision.
✅ Key Takeaway
The CLT guarantees that sample means are approximately normally distributed for large enough n, regardless of the population shape. This is the foundation of all hypothesis testing and confidence intervals. Standard error SE = σ/√n tells you the precision of your sample mean.
🏆 Module 3 Complete!
You’ve witnessed the most important theorem in statistics — live in R. Now we’ll use the CLT to build confidence intervals and quantify our uncertainty.