Tidyr & Data Reshaping
Tidy data is happy data
📖 Concept Recap
Tidy data has one rule: each variable is a column, each observation is a row. Real-world data is often “wide” (one row per subject, one column per time point), but tidy “long” format is needed for ggplot2 and dplyr.
- pivot_longer() — wide → long: stack multiple columns into key-value pairs
- pivot_wider() — long → wide: spread a key-value pair into columns
- separate() — split one column into two (e.g. “2024-01” → year, month)
- unite() — combine two columns into one
Use library(tidyr) together with library(dplyr). Both load without installation in WebR.
👀 Worked Example
Pivoting exam scores from wide format to long, then summarizing per student:
Exercise 1 — Monthly Sales Pivot
Pivot the wide monthly sales table to long format so each row represents one rep–month combination.
pivot_longer(cols = c(Jan, Feb, Mar), ...) stacks the three month columns. Alternatively use cols = -rep to pivot all columns except rep. The names_to argument names the new key column.Exercise 2 — City Population Growth
Start with a wide city population table (2020–2023). Pivot to long, calculate year-over-year growth rates, and find the fastest-growing city each year.
lag(population) gets the previous row’s value within each city group — perfect for year-over-year growth. sub("pop_", "", year) strips the prefix to get just the year number.Exercise 3 — separate() and unite()
Use separate() to split full_name into first and last columns, and split date_range into start and end dates. Then use unite() to create a name_id column by combining last name and ID.
separate(col, into=c("a","b"), sep="pattern") splits on a regex pattern. unite("new_col", col1, col2, sep="_") combines two columns with a separator. Both functions modify the data frame in a pipeline.Wide → Long → Letter Grades → Wide Again
Transform a wide gradebook through a complete pipeline: pivot to long, add letter grade column, then pivot back wide with letter grades, and generate a summary table.
pivot_wider(names_from = assignment, values_from = grade). The names_from column provides the new column names and values_from provides the cell values.✅ Lab 8 Complete!
You can now reshape any dataset between wide and long formats, split and combine columns, and build complete data pipelines. Tidy data enables all the visualization and analysis techniques from earlier labs.
Continue to Lab 9: Professional Output & Reporting →