Tidyr & Data Reshaping

Tidy data is happy data

← Lab 7: Statistics Lab 8 of 10 Lab 9: Reporting →

Loading R... (first load takes ~15 seconds)

Concept Recap

Tidy data has one rule: each variable is a column, each observation is a row. Real-world data is often “wide” (one row per subject, one column per time point), but tidy “long” format is needed for ggplot2 and dplyr.

pivot_longer() — wide → long: stack multiple columns into key-value pairs
pivot_wider()— long → wide: spread a key-value pair into columns
separate() — split one column into two (e.g. “2024-01” → year, month)
unite() — combine two columns into one

Use library(tidyr) together with library(dplyr). Both load without installation in WebR.

Worked Example

Pivoting exam scores from wide format to long, then summarizing per student:

library(tidyr); library(dplyr) wide <- data.frame( student = c("Alice","Bob","Carol"), exam_1 = c(88, 75, 92), exam_2 = c(91, 80, 88), exam_3 = c(85, 78, 95) ) cat("Wide format:\n"); print(wide) long <- wide |> pivot_longer( cols = starts_with("exam"), names_to = "exam", values_to = "score" ) cat("\nLong format:\n"); print(long) long |> group_by(student) |> summarize(avg = mean(score), best = max(score)) |> print()

Guided

Exercise 1 — Monthly Sales Pivot

Pivot the wide monthly sales table to long format so each row represents one rep–month combination.

library(tidyr); library(dplyr)
monthly_sales <- data.frame(
  rep = c("Alice","Bob","Carol","David"),
  Jan = c(18500, 12300, 22100, 15600),
  Feb = c(19800, 16900, 25400, 13200),
  Mar = c(20500, 14200, 26800, 15100)
)
cat("Wide:\n"); print(monthly_sales)

# Pivot to long format
long_sales <- monthly_sales |>
  pivot_longer(cols = c(Jan, Feb, Mar),
               names_to  = "month",
               values_to = "revenue")
cat("\nLong:\n"); print(long_sales)

# Summary: total revenue by rep
cat("\nTotal by rep:\n")
long_sales |> group_by(rep) |>
  summarize(total = sum(revenue), avg = round(mean(revenue), 0)) |>
  arrange(desc(total)) |> print()

Output will appear here...

Hint: pivot_longer(cols = c(Jan, Feb, Mar), ...) stacks the three month columns. Alternatively use cols = -rep to pivot all columns except rep. The names_to argument names the new key column.

Independent

Exercise 2 — City Population Growth

Start with a wide city population table (2020–2023). Pivot to long, calculate year-over-year growth rates, and find the fastest-growing city each year.

library(tidyr); library(dplyr)
cities <- data.frame(
  city     = c("Springfield","Riverside","Lakewood","Hillcrest"),
  pop_2020 = c(142000, 98000, 215000, 67000),
  pop_2021 = c(145500, 101200, 219800, 69400),
  pop_2022 = c(148100, 105600, 222300, 72100),
  pop_2023 = c(153000, 112000, 224800, 76500)
)

# 1. Pivot to long format (city, year, population)
long_cities <- cities |>
  pivot_longer(cols = starts_with("pop_"),
               names_to  = "year",
               values_to = "population") |>
  mutate(year = as.integer(sub("pop_", "", year)))

cat("Long format (first 8 rows):\n")
print(head(long_cities, 8))

# 2. Calculate year-over-year growth rate
growth <- long_cities |>
  group_by(city) |>
  arrange(city, year) |>
  mutate(growth_pct = round((population / lag(population) - 1) * 100, 2)) |>
  filter(!is.na(growth_pct))

cat("\nGrowth rates:\n"); print(growth)

# 3. Fastest growing city each year
cat("\nFastest growing city per year:\n")
growth |> group_by(year) |> slice_max(growth_pct, n=1) |>
  select(year, city, growth_pct) |> print()

Output will appear here...

Hint: lag(population) gets the previous row’s value within each city group — perfect for year-over-year growth. sub("pop_", "", year) strips the prefix to get just the year number.

Challenge

Exercise 3 — separate() and unite()

Use separate() to split full_name into first and last columns, and split date_range into start and end dates. Then use unite() to create a name_id column by combining last name and ID.

library(tidyr); library(dplyr)
records <- data.frame(
  id         = 1:5,
  full_name  = c("Smith, John","Lee, Sarah","Brown, Carlos","Kim, Diana","Wang, Ethan"),
  date_range = c("2024-01 to 2024-06","2024-02 to 2024-08",
                 "2024-03 to 2024-09","2024-01 to 2024-12","2024-04 to 2024-10"),
  score      = c(88, 92, 76, 95, 83)
)

# Split full_name into last and first
split1 <- records |>
  separate(full_name, into = c("last", "first"), sep = ", ")

# Split date_range into start_date and end_date
split2 <- split1 |>
  separate(date_range, into = c("start_date", "end_date"), sep = " to ")

cat("After separating:\n"); print(split2)

# Unite last name and id into name_id (e.g. "Smith_1")
final <- split2 |>
  unite("name_id", last, id, sep = "_")
cat("\nAfter uniting name_id:\n"); print(final)

Output will appear here...

Hint: separate(col, into=c("a","b"), sep="pattern") splits on a regex pattern. unite("new_col", col1, col2, sep="_") combines two columns with a separator. Both functions modify the data frame in a pipeline.

Mini Project — Gradebook Transformation

Wide → Long → Letter Grades → Wide Again

Transform a wide gradebook through a complete pipeline: pivot to long, add letter grade column, then pivot back wide with letter grades, and generate a summary table.

library(tidyr); library(dplyr)
gradebook <- data.frame(
  student = c("Alice","Bob","Carol","Diana","Ethan","Fiona"),
  hw1     = c(92, 78, 88, 95, 71, 85),
  hw2     = c(87, 82, 91, 90, 68, 88),
  midterm = c(85, 76, 93, 88, 72, 90),
  hw3     = c(90, 80, 87, 92, 75, 83),
  stringsAsFactors = FALSE
)

cat("Original (wide):\n"); print(gradebook)

# Step 1: Pivot to long
long_gb <- gradebook |>
  pivot_longer(cols = -student, names_to = "assignment", values_to = "score")

# Step 2: Add letter grade
long_gb <- long_gb |>
  mutate(grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    score >= 70 ~ "C",
    score >= 60 ~ "D",
    TRUE        ~ "F"
  ))

cat("\nLong format with grades (first 8 rows):\n")
print(head(long_gb, 8))

# Step 3: Pivot back wide with letter grades
wide_grades <- long_gb |>
  select(student, assignment, grade) |>
  pivot_wider(names_from = assignment, values_from = grade)

cat("\nWide format (letter grades):\n"); print(wide_grades)

# Step 4: Summary table
cat("\nStudent summary:\n")
long_gb |> group_by(student) |>
  summarize(avg_score = round(mean(score), 1),
            final_grade = case_when(
              mean(score) >= 90 ~ "A",
              mean(score) >= 80 ~ "B",
              mean(score) >= 70 ~ "C",
              TRUE ~ "C-")) |>
  arrange(desc(avg_score)) |> print()

Output will appear here...

Hint: When pivoting back wide, use pivot_wider(names_from = assignment, values_from = grade). The names_from column provides the new column names and values_from provides the cell values.

Lab 8 Complete!

You can now reshape any dataset between wide and long formats, split and combine columns, and build complete data pipelines. Tidy data enables all the visualization and analysis techniques from earlier labs.

Continue to Lab 9: Professional Output & Reporting →

← Lab 7: Statistics Lab 8 of 10 Lab 9: Reporting →