dplyr: Data Wrangling
The grammar of data manipulation
📖 Concept Recap
dplyr provides 5 main verbs that cover nearly all data manipulation needs:
- filter() — keep rows matching a condition
- select() — choose specific columns
- mutate() — add or transform columns
- arrange() — sort rows (use
desc()to reverse) - summarize() + group_by() — aggregate by group
The pipe |> chains operations: output of the left becomes input on the right. No need to install in WebR — just library(dplyr). Use n() inside summarize to count rows per group.
👀 Worked Example
Study this complete dplyr pipeline before starting the exercises:
Exercise 1 — Student Wrangling
The blanks have been filled in for you — run the code and study the output, then experiment by changing the filter threshold or sort order.
filter(gpa > 3.5) keeps rows where GPA exceeds 3.5. group_by(major) splits the data before summarize() computes per-group stats. rank(-gpa) ranks from highest to lowest.Exercise 2 — Advanced Student Summaries
Using the same students data frame, write three separate pipelines:
- Find the top student per major using
group_by(major) |> slice_max(gpa, n=1) - Count students per year+major combination using
group_by(year, major) |> summarize(count = n()) - Create a summary by major showing count, average GPA, and max GPA, arranged by avg_gpa descending
slice_max(gpa, n=1) picks the row with the highest GPA per group. Chain multiple group_by columns for crossed summaries. Use summarize(count=n(), avg=mean(gpa), top=max(gpa)) followed by arrange(desc(avg)).Exercise 3 — Performance Classification Pipeline
Build a single pipeline on students that: (1) filters to year ≥ 2, (2) adds a performance column using case_when() — “Excellent” if gpa ≥ 3.7, “Good” if gpa ≥ 3.3, “Fair” otherwise, (3) groups by major and performance, (4) counts, (5) arranges by major then count descending.
case_when() works like nested if/else — each condition uses ~ to separate the test from the result. TRUE ~ "Fair" is the catch-all default. Add .groups = "drop" after summarize to avoid a warning.Full Sales Pipeline Analysis
Using the sales data frame below, use dplyr pipelines to answer all 5 questions. Use |> throughout.
filter(revenue > 15000) |> distinct(rep). Q5: compute team average with mean(revenue) then calculate percentage above. Use pull() to extract a single value from a pipeline result.✅ Lab 3 Complete!
You’ve mastered dplyr — the most widely used R package for data wrangling. The five verbs plus pipes let you express complex transformations in clean, readable code.
Continue to Lab 4: Functions & Control Flow →