Module 2 Study Guide

Descriptive Statistics

Free Statistics Learning Platform • Safaa Dabagh

1. Measures of Center

Measures of Center: Single values that represent the "typical" or "central" value in a dataset. They help us summarize an entire dataset with one number.

The Three Main Measures

Mean (Average)

Mean = (Sum of all values) ÷ (Number of values)

x̄ = Σx / n

Where: x̄ = mean, Σx = sum of all values, n = number of values

Example: Test scores: 85, 90, 78, 92, 88
Mean = (85 + 90 + 78 + 92 + 88) ÷ 5 = 433 ÷ 5 = 86.6

Median (Middle Value)

Median: The middle value when data is arranged in order
How to Find the Median:
  1. Arrange data in order from smallest to largest
  2. If odd number of values: Median is the middle value
  3. If even number of values: Median is the average of the two middle values
Example (odd): 2, 5, 7, 9, 12 → Median = 7 (middle value)
Example (even): 3, 5, 8, 10 → Median = (5 + 8) ÷ 2 = 6.5

Mode (Most Frequent)

Mode: The value that appears most often in the dataset
Example: 3, 7, 7, 9, 12, 7, 15 → Mode = 7 (appears 3 times)

When to Use Each Measure

Measure Best Used When... Advantages Disadvantages
Mean Data is fairly symmetric with no extreme outliers Uses all data points; familiar to everyone Very sensitive to outliers
Median Data has outliers or is skewed Not affected by outliers; better for skewed data Doesn't use all information
Mode Categorical data or finding most common value Easy to identify; works with any data type May not exist or may not be unique
CRITICAL: For salary data, home prices, and other datasets with extreme values, the median is usually more representative than the mean!

2. Measures of Spread (Variability)

Measures of Spread: Values that describe how spread out or dispersed the data is. They tell us how much variability exists in the dataset.

Range

Range = Maximum − Minimum

Example: Data: 5, 12, 8, 20, 15
Range = 20 − 5 = 15

Limitation: Only uses two values (ignores everything in between); very sensitive to outliers

Interquartile Range (IQR)

IQR = Q3 − Q1

Where: Q1 = first quartile (25th percentile), Q3 = third quartile (75th percentile)

IQR: The range of the middle 50% of the data. It measures spread while being resistant to outliers.
How to Find IQR:
  1. Arrange data in order
  2. Find the median (Q2)
  3. Q1 = median of the lower half (below Q2)
  4. Q3 = median of the upper half (above Q2)
  5. IQR = Q3 − Q1
Example: 3, 5, 7, 9, 11, 13, 15, 17, 19
Median (Q2) = 11
Lower half: 3, 5, 7, 9 → Q1 = (5 + 7) ÷ 2 = 6
Upper half: 13, 15, 17, 19 → Q3 = (15 + 17) ÷ 2 = 16
IQR = 16 − 6 = 10

Variance

s² = Σ(x − x̄)² / (n − 1)

Where: s² = variance, x = each value, x̄ = mean, n = sample size

Variance: The average of the squared deviations from the mean. It measures how far each value is from the mean, on average.
Step-by-Step Calculation:
  1. Calculate the mean (x̄)
  2. For each value, find (x − x̄)
  3. Square each deviation: (x − x̄)²
  4. Sum all squared deviations: Σ(x − x̄)²
  5. Divide by (n − 1)
Example: Data: 4, 6, 8, 10
Mean = (4 + 6 + 8 + 10) ÷ 4 = 7
xx − x̄(x − x̄)²
4−39
6−11
811
1039
Sum:20

s² = 20 ÷ (4 − 1) = 20 ÷ 3 = 6.67

Standard Deviation

s = √[Σ(x − x̄)² / (n − 1)]

Simplified: s = √(variance)

Standard Deviation (SD): The square root of the variance. It measures the typical distance of values from the mean, in the same units as the data.
Example (continued from variance):
s = √6.67 = 2.58
Interpretation: On average, values are about 2.58 units away from the mean.
KEY DIFFERENCE: Variance is in squared units; standard deviation is in original units (easier to interpret!)

The Empirical Rule (68-95-99.7 Rule)

The Empirical Rule

For bell-shaped (normal) distributions:

Example: SAT scores have mean = 1050 and SD = 100
• About 68% of students score between 950 and 1150 (1050 ± 100)
• About 95% score between 850 and 1250 (1050 ± 200)
• About 99.7% score between 750 and 1350 (1050 ± 300)
IMPORTANT: The Empirical Rule only works for bell-shaped (approximately normal) distributions!

3. Distribution Shapes and Skewness

Distribution Shape: The overall pattern of how data values are spread out. Understanding shape helps us choose appropriate summary statistics.

Three Main Shapes

Symmetric Distribution

Characteristics:

Examples: Heights of adults, IQ scores, measurement errors, test scores (well-designed tests)

Right-Skewed (Positively Skewed)

Characteristics:

Examples: Income, home prices, age at first marriage, company revenues

Left-Skewed (Negatively Skewed)

Characteristics:

Examples: Age at death, exam scores (very easy test), reaction times

Mean vs. Median Relationship by Shape

Distribution Shape Relationship Best Measure of Center
Symmetric Mean ≈ Median ≈ Mode Mean (uses all data)
Right-Skewed Mode < Median < Mean Median (not affected by high outliers)
Left-Skewed Mean < Median < Mode Median (not affected by low outliers)
QUICK TIP: The mean is always pulled in the direction of the skew (toward the tail)!

Impact of Outliers

Outlier: A data value that is unusually far from the rest of the data
Statistic Affected by Outliers? Called...
Mean YES - very sensitive Not resistant
Median NO - stays stable Resistant
Mode NO - unaffected Resistant
Range YES - very sensitive Not resistant
IQR NO - stays stable Resistant
Standard Deviation YES - very sensitive Not resistant

4. Five-Number Summary & Boxplots

Five-Number Summary

Five-Number Summary: A set of five values that completely describe the distribution of a dataset

The Five Numbers:

  1. Minimum (Min): Smallest value
  2. First Quartile (Q1): 25th percentile
  3. Median (Q2): 50th percentile (middle value)
  4. Third Quartile (Q3): 75th percentile
  5. Maximum (Max): Largest value
Example: Data: 2, 4, 6, 8, 10, 12, 14, 16, 18
• Min = 2
• Q1 = 5 (median of 2, 4, 6, 8)
• Median (Q2) = 10
• Q3 = 15 (median of 12, 14, 16, 18)
• Max = 18
Five-Number Summary: {2, 5, 10, 15, 18}

Boxplots (Box-and-Whisker Plots)

Boxplot: A visual display of the five-number summary that shows the distribution shape and identifies potential outliers
Boxplot Components:

Outlier Detection Rule (1.5×IQR Rule)

Lower Fence = Q1 − (1.5 × IQR)

Upper Fence = Q3 + (1.5 × IQR)

Any value below the lower fence or above the upper fence is considered an outlier

Example: Q1 = 30, Q3 = 50, IQR = 20
Lower Fence = 30 − (1.5 × 20) = 30 − 30 = 0
Upper Fence = 50 + (1.5 × 20) = 50 + 30 = 80
Outliers: Any value < 0 or > 80

Reading Boxplots

What Boxplots Tell Us:
  1. Center: Location of the median line
  2. Spread: Width of the box and length of whiskers
  3. Shape:
    • Symmetric: Median centered in box, equal whiskers
    • Right-skewed: Median left of center, longer right whisker
    • Left-skewed: Median right of center, longer left whisker
  4. Outliers: Individual points beyond whiskers

Comparing Distributions with Boxplots

Side-by-side boxplots let us compare:

Quick Reference: All Formulas

Measures of Center

Mean: x̄ = Σx / n

Median: Middle value when ordered

Mode: Most frequent value

Measures of Spread

Range = Max − Min

IQR = Q3 − Q1

Variance: s² = Σ(x − x̄)² / (n − 1)

Standard Deviation: s = √s²

Empirical Rule (Bell-Shaped Distributions)

68% within x̄ ± 1s

95% within x̄ ± 2s

99.7% within x̄ ± 3s

Outlier Detection

Lower Fence = Q1 − 1.5×IQR

Upper Fence = Q3 + 1.5×IQR

Distribution Shapes

Symmetric: Mean ≈ Median ≈ Mode

Right-Skewed: Mode < Median < Mean

Left-Skewed: Mean < Median < Mode

Important Reminders

1. Use median for skewed data or data with outliers
2. Use mean for symmetric data without outliers
3. IQR is resistant to outliers; range and SD are not
4. The Empirical Rule only works for bell-shaped distributions
5. Mean is always pulled toward the direction of the skew
6. A boxplot shows the five-number summary visually
7. Standard deviation has the same units as the data

Module 2: Descriptive Statistics

Free Statistics Learning Platform • Safaa Dabagh • sdabagh.github.io

© 2025 • Part of UCLA Dissertation Research