String Manipulation
Cleaning and transforming text in R
📖 Concept Recap
R has two worlds of string tools — base R and the stringr package (part of the tidyverse):
- Base R:
paste(),paste0(),nchar(),toupper(),tolower(),trimws(),substr(),gsub(),grepl(),strsplit() - stringr:
str_trim(),str_detect(),str_replace(),str_replace_all(),str_split(),str_extract(),str_length(),str_count(),str_to_title(),str_remove()
stringr functions are consistent: they always take the string as the first argument and use the same pattern syntax. Use library(stringr) — no install needed in WebR.
👀 Worked Example
A name-cleaning function handling multiple messy formats:
Exercise 1 — Text Analyzer
Run this text analysis code and study what each stringr function returns. Experiment by changing my_text.
\\b in a regex pattern means “word boundary”. \\w+ matches one or more word characters. \\w{5,} matches 5 or more word characters (words with at least 5 letters).Exercise 2 — Phone Number Formatter
Write a format_phone(raw) function that takes messy phone number strings and returns them in standard 213-555-0123 format. Test on all 4 formats below.
str_replace_all(x, "[^0-9]", "") removes everything that is NOT a digit. substr(digits, 1, 3) extracts the area code. nchar() gives the string length.Exercise 3 — Email List Parser
Write parse_email_list(emails) that takes a character vector of "Name <email@domain.com>" strings and returns a data frame with columns: display_name, username, domain.
(?<=<)[^>]+ is a lookbehind regex: match characters after < but before >. str_extract() returns NA if no match is found, which is helpful for error detection.Full Paragraph Analysis
Perform a complete text analysis of the paragraph below. Compute all 6 metrics and print a formatted report.
table(words) counts word frequencies. sort(..., decreasing=TRUE)[1:5] gets the top 5. nchar() works on vectors — mean(nchar(words) > 6) gives the proportion of long words.✅ Lab 5 Complete!
You can now clean messy text, extract patterns with regex, and transform strings for real-world data cleaning tasks. String manipulation is essential for working with survey data, web scraping, and log files.
Continue to Lab 6: ggplot2 Visualization →