dplyr: Data Wrangling

The grammar of data manipulation

← Lab 2: Data Frames Lab 3 of 10 Lab 4: Functions →

Loading R... (first load takes ~15 seconds)

Concept Recap

dplyr provides 5 main verbs that cover nearly all data manipulation needs:

filter() — keep rows matching a condition
select() — choose specific columns
mutate() — add or transform columns
arrange() — sort rows (use desc() to reverse)
summarize() + group_by() — aggregate by group

The pipe |> chains operations: output of the left becomes input on the right. No need to install in WebR — just library(dplyr). Use n()inside summarize to count rows per group.

Worked Example

Study this complete dplyr pipeline before starting the exercises:

library(dplyr) sales <- data.frame( rep = c("Alice","Bob","Carol","Alice","David","Bob","Carol","David"), region = c("West","East","West","West","South","East","West","South"), month = c("Jan","Jan","Jan","Feb","Jan","Feb","Feb","Feb"), revenue= c(18500,12300,22100,19800,15600,16900,25400,13200), stringsAsFactors = FALSE ) result <- sales |> group_by(rep) |> summarize(total = sum(revenue), avg = mean(revenue), n = n()) |> arrange(desc(total)) print(result) west <- sales |> filter(region == "West") |> mutate(quota_met = revenue > 15000) |> select(rep, month, revenue, quota_met) print(west)

Guided

Exercise 1 — Student Wrangling

The blanks have been filled in for you — run the code and study the output, then experiment by changing the filter threshold or sort order.

library(dplyr)
students <- data.frame(
  name  = c("Alex","Beth","Carlos","Diana","Ethan","Fiona","George","Hana"),
  major = c("CS","Math","CS","English","CS","Math","English","CS"),
  gpa   = c(3.8, 3.2, 3.9, 2.9, 3.5, 3.7, 3.1, 3.6),
  year  = c(3, 2, 4, 1, 2, 3, 4, 1),
  stringsAsFactors = FALSE)

# Filter GPA > 3.5, arrange descending
top <- students |> filter(gpa > 3.5) |> arrange(desc(gpa))
print(top)

# Average GPA by major
by_major <- students |> group_by(major) |>
  summarize(avg_gpa = mean(gpa), count = n())
print(by_major)

# Add rank column
ranked <- students |> mutate(rank = rank(-gpa)) |>
  select(name, gpa, rank) |> arrange(rank)
print(ranked)

Output will appear here...

Hint: filter(gpa > 3.5)keeps rows where GPA exceeds 3.5. group_by(major) splits the data before summarize()computes per-group stats. rank(-gpa) ranks from highest to lowest.

Independent

Exercise 2 — Advanced Student Summaries

Using the same students data frame, write three separate pipelines:

Find the top student per major using group_by(major) |> slice_max(gpa, n=1)
Count students per year+major combination using group_by(year, major) |> summarize(count = n())
Create a summary by major showing count, average GPA, and max GPA, arranged by avg_gpa descending

library(dplyr)
students <- data.frame(
  name  = c("Alex","Beth","Carlos","Diana","Ethan","Fiona","George","Hana"),
  major = c("CS","Math","CS","English","CS","Math","English","CS"),
  gpa   = c(3.8, 3.2, 3.9, 2.9, 3.5, 3.7, 3.1, 3.6),
  year  = c(3, 2, 4, 1, 2, 3, 4, 1),
  stringsAsFactors = FALSE)

# 1. Top student per major

# 2. Count by year + major combo

# 3. Summary by major (count, avg_gpa, max_gpa) arranged by avg_gpa desc

Output will appear here...

Hint: slice_max(gpa, n=1) picks the row with the highest GPA per group. Chain multiple group_by columns for crossed summaries. Use summarize(count=n(), avg=mean(gpa), top=max(gpa)) followed by arrange(desc(avg)).

Challenge

Exercise 3 — Performance Classification Pipeline

Build a single pipeline on students that: (1) filters to year ≥ 2, (2) adds a performance column using case_when() — “Excellent” if gpa ≥ 3.7, “Good” if gpa ≥ 3.3, “Fair” otherwise, (3) groups by major and performance, (4) counts, (5) arranges by major then count descending.

library(dplyr)
students <- data.frame(
  name  = c("Alex","Beth","Carlos","Diana","Ethan","Fiona","George","Hana"),
  major = c("CS","Math","CS","English","CS","Math","English","CS"),
  gpa   = c(3.8, 3.2, 3.9, 2.9, 3.5, 3.7, 3.1, 3.6),
  year  = c(3, 2, 4, 1, 2, 3, 4, 1),
  stringsAsFactors = FALSE)

result <- students |>
  filter(year >= 2) |>
  mutate(performance = case_when(
    gpa >= 3.7 ~ "Excellent",
    gpa >= 3.3 ~ "Good",
    TRUE        ~ "Fair"
  )) |>
  group_by(major, performance) |>
  summarize(count = n(), .groups = "drop") |>
  arrange(major, desc(count))
print(result)

Output will appear here...

Hint: case_when() works like nested if/else — each condition uses ~ to separate the test from the result. TRUE ~ "Fair" is the catch-all default. Add .groups = "drop" after summarize to avoid a warning.

Mini Project — Sales Analysis

Full Sales Pipeline Analysis

Using the sales data frame below, use dplyr pipelines to answer all 5 questions. Use |> throughout.

library(dplyr)
sales <- data.frame(
  rep    = c("Alice","Bob","Carol","Alice","David","Bob","Carol","David"),
  region = c("West","East","West","West","South","East","West","South"),
  month  = c("Jan","Jan","Jan","Feb","Jan","Feb","Feb","Feb"),
  revenue= c(18500,12300,22100,19800,15600,16900,25400,13200),
  stringsAsFactors = FALSE
)

# Q1: Total and average revenue by rep (arrange by total desc)

# Q2: Which region had the highest total revenue?

# Q3: Compare Jan vs Feb total revenue

# Q4: Which reps exceeded a $15,000 quota in ANY month? (distinct reps)

# Q5: Top performer — their name, total, and % above team average

Output will appear here...

Hint: Q4: filter(revenue > 15000) |> distinct(rep). Q5: compute team average with mean(revenue) then calculate percentage above. Use pull() to extract a single value from a pipeline result.

Lab 3 Complete!

You’ve mastered dplyr — the most widely used R package for data wrangling. The five verbs plus pipes let you express complex transformations in clean, readable code.

Continue to Lab 4: Functions & Control Flow →

← Lab 2: Data Frames Lab 3 of 10 Lab 4: Functions →