Descriptive Statistics in R
Summarizing data — the foundation of all analysis
📌 Before You Start
What you need: No prior R experience required. Familiarity with the idea of a "mean" or "average" is helpful but not required.
What you’ll learn: How to compute and interpret measures of center (mean, median, mode) and measures of spread (range, variance, standard deviation, IQR) using R. You’ll also practice choosing the right measure for skewed vs. symmetric data.
📖 The Concept: Descriptive Statistics
Descriptive statistics summarize and describe a dataset without making inferences beyond the data you have. They are always your first step in any analysis.
Measures of center tell you where the "middle" of the data is:
- Mean — the arithmetic average. Sensitive to outliers.
- Median — the middle value when sorted. Robust to outliers.
- Mode — the most frequent value. Most useful for categorical data.
Measures of spread tell you how variable the data is:
- Range — max minus min. Very sensitive to outliers.
- Variance (s²) — average squared deviation from the mean.
- Standard deviation (s) — square root of variance. In the same units as your data.
- IQR — interquartile range (Q3 − Q1). Robust to outliers.
Key insight: When data is skewed (asymmetric), the median and IQR are more representative than the mean and SD. When data is symmetric, the mean and SD work well.
🔢 The Formulas
Mean: sum all values, divide by count
Sample variance: average squared distance from the mean (note: n−1, not n)
Standard deviation: square root of variance (brings units back)
💻 In R — Worked Example (read-only)
Study this code. It shows how to compute every major descriptive statistic in R. Notice how little code it takes.
🖐️ Your Turn
Exercise 1 — Which measure fits best?
You have 8 test scores. One student scored 55 — much lower than the rest. Calculate all the descriptive stats and decide: is the mean or median a better measure of center here, and why?
Exercise 2 — Visualize with a histogram
Create a histogram of the exam scores and add vertical lines marking the mean (blue) and median (red). When the lines are far apart, data is skewed. When they are close, data is roughly symmetric.
Exercise 3 — Symmetric vs. Skewed data
Simulate two datasets: one symmetric (normal distribution) and one right-skewed (exponential distribution). Compare mean vs. median for each. Observe how the gap between them signals skewness.
🧠 Brain Break
You’ve just computed the building blocks of all statistical analysis. Take a moment.
Quick check: If a dataset has mean = 80 and median = 65, which direction is it skewed? (Answer: right-skewed — the mean is pulled up by high outliers.)
✅ Key Takeaway
Mean describes center for symmetric data. Median is more robust when data is skewed or has outliers. Standard deviation measures spread for symmetric data; use IQR for skewed data.
🏆 Module 1 Complete!
You can now summarize any dataset in R — computing every descriptive statistic and choosing the right one for the data’s shape. This foundation supports everything that follows.