Descriptive Statistics in R

Summarizing data — the foundation of all analysis

← Course Home Module 1 of 8 Module 2: Probability Distributions →

Loading R... (first load takes ~15 seconds)

Before You Start

What you need: No prior R experience required. Familiarity with the idea of a "mean" or "average" is helpful but not required.

What you’ll learn: How to compute and interpret measures of center (mean, median, mode) and measures of spread (range, variance, standard deviation, IQR) using R. You’ll also practice choosing the right measure for skewed vs. symmetric data.

The Concept: Descriptive Statistics

Descriptive statisticssummarize and describe a dataset without making inferences beyond the data you have. They are always your first step in any analysis.

Measures of center tell you where the "middle" of the data is:

Mean — the arithmetic average. Sensitive to outliers.
Median — the middle value when sorted. Robust to outliers.
Mode — the most frequent value. Most useful for categorical data.

Measures of spread tell you how variable the data is:

Range — max minus min. Very sensitive to outliers.
Variance (s²) — average squared deviation from the mean.
Standard deviation (s)— square root of variance. In the same units as your data.
IQR — interquartile range (Q3 − Q1). Robust to outliers.

Key insight: When data is skewed (asymmetric), the median and IQR are more representative than the mean and SD. When data is symmetric, the mean and SD work well.

The Formulas

x̄ = Σx / n

Mean: sum all values, divide by count

s² = Σ(x − x̄)² / (n−1)

Sample variance: average squared distance from the mean (note: n−1, not n)

s = √s²

Standard deviation: square root of variance (brings units back)

In R — Worked Example (read-only)

Study this code. It shows how to compute every major descriptive statistic in R. Notice how little code it takes.

# Descriptive statistics in R scores <- c(72, 85, 91, 68, 77, 84, 90, 73, 88, 95, 62, 79) cat("=== Measures of Center ===\n") cat("Mean: ", round(mean(scores), 2), "\n") cat("Median:", median(scores), "\n") # Mode (R has no built-in mode for continuous data) mode_val <- as.numeric(names(sort(table(scores), decreasing=TRUE)[1])) cat("Mode: ", mode_val, "\n") cat("\n=== Measures of Spread ===\n") cat("Range:", range(scores)[1], "to", range(scores)[2], "\n") cat("Variance:", round(var(scores), 2), "\n") cat("Std Dev: ", round(sd(scores), 2), "\n") cat("IQR: ", IQR(scores), "\n") cat("\n=== Quick Summary ===\n") print(summary(scores))

Your Turn

Exercise 1 — Which measure fits best?

You have 8 test scores. One student scored 55 — much lower than the rest. Calculate all the descriptive stats and decide: is the mean or median a better measure of center here, and why?

test_scores <- c(88, 92, 75, 95, 84, 91, 78, 55)

# Calculate mean, median, variance, SD, IQR
cat("Mean:   ", round(mean(test_scores), 2), "\n")
cat("Median: ", median(test_scores), "\n")
cat("SD:     ", round(sd(test_scores), 2), "\n")
cat("IQR:    ", IQR(test_scores), "\n")

# Remove the outlier (55) and see how the mean changes
no_outlier <- test_scores[test_scores != 55]
cat("\n--- Without the outlier ---\n")
cat("Mean:   ", round(mean(no_outlier), 2), "\n")
cat("Median: ", median(no_outlier), "\n")

# Which measure changed more? Which is more robust?
cat("\nMean changed by:", round(mean(no_outlier) - mean(test_scores), 2), "points\n")
cat("Median changed by:", median(no_outlier) - median(test_scores), "points\n")

Output will appear here...

What to notice: The mean is pulled toward the outlier (55). The median barely moves. This is why the median is called "robust" — it resists the influence of extreme values.

Exercise 2 — Visualize with a histogram

Create a histogram of the exam scores and add vertical lines marking the mean (blue) and median (red). When the lines are far apart, data is skewed. When they are close, data is roughly symmetric.

scores <- c(72, 85, 91, 68, 77, 84, 90, 73, 88, 95, 62, 79)

# Create histogram
hist(scores,
     main = "Exam Score Distribution",
     xlab = "Score",
     col = "#B2DFDB",
     border = "white",
     breaks = 8)

# Add vertical lines for mean and median
abline(v = mean(scores), col = "blue", lwd = 2, lty = 2)
abline(v = median(scores), col = "red", lwd = 2, lty = 1)

# Add a legend
legend("topright",
       legend = c(paste("Mean =", round(mean(scores), 1)),
                  paste("Median =", median(scores))),
       col = c("blue", "red"),
       lwd = 2, lty = c(2, 1))

Output will appear here...

Try it: Change the scores vector to make the data more skewed — add a very low score like 20. Watch how the mean moves toward the outlier but the median stays more central.

Exercise 3 — Symmetric vs. Skewed data

Simulate two datasets: one symmetric (normal distribution) and one right-skewed (exponential distribution). Compare mean vs. median for each. Observe how the gap between them signals skewness.

set.seed(42)

# Symmetric data (normal distribution)
symmetric <- rnorm(200, mean = 70, sd = 10)

# Skewed data (exponential, then shifted)
skewed <- rexp(200, rate = 0.1) + 40

cat("=== Symmetric Data ===\n")
cat("Mean:   ", round(mean(symmetric), 2), "\n")
cat("Median: ", round(median(symmetric), 2), "\n")
cat("Gap:    ", round(mean(symmetric) - median(symmetric), 2), "\n")

cat("\n=== Skewed Data ===\n")
cat("Mean:   ", round(mean(skewed), 2), "\n")
cat("Median: ", round(median(skewed), 2), "\n")
cat("Gap:    ", round(mean(skewed) - median(skewed), 2), "\n")
cat("\nLarger gap = more skewed.\n")
cat("For skewed data, the median is usually a better summary.\n")

# Plot both side by side
par(mfrow = c(1, 2))
hist(symmetric, main = "Symmetric", xlab = "Value",
     col = "#B2DFDB", border = "white")
abline(v = mean(symmetric), col = "blue", lwd = 2)
abline(v = median(symmetric), col = "red", lwd = 2)

hist(skewed, main = "Right-Skewed", xlab = "Value",
     col = "#F8BBD9", border = "white")
abline(v = mean(skewed), col = "blue", lwd = 2)
abline(v = median(skewed), col = "red", lwd = 2)
par(mfrow = c(1, 1))

Output will appear here...

Brain Break

You’ve just computed the building blocks of all statistical analysis. Take a moment.

Quick check: If a dataset has mean = 80 and median = 65, which direction is it skewed? (Answer: right-skewed — the mean is pulled up by high outliers.)

Key Takeaway

Mean describes center for symmetric data. Median is more robust when data is skewed or has outliers. Standard deviation measures spread for symmetric data; use IQR for skewed data.

Module 1 Complete!

You can now summarize any dataset in R — computing every descriptive statistic and choosing the right one for the data’s shape. This foundation supports everything that follows.

Continue to Module 2: Probability Distributions →

← Course Home Module 1 of 8 Module 2: Probability Distributions →