String Manipulation

Cleaning and transforming text in R

← Lab 4: Functions Lab 5 of 10 Lab 6: ggplot2 →

Loading R... (first load takes ~15 seconds)

Concept Recap

R has two worlds of string tools — base R and the stringr package (part of the tidyverse):

Base R: paste(), paste0(), nchar(), toupper(), tolower(), trimws(), substr(), gsub(), grepl(), strsplit()
stringr: str_trim(), str_detect(), str_replace(), str_replace_all(), str_split(), str_extract(), str_length(), str_count(), str_to_title(), str_remove()

stringr functions are consistent: they always take the string as the first argument and use the same pattern syntax. Use library(stringr) — no install needed in WebR.

Worked Example

A name-cleaning function handling multiple messy formats:

library(stringr) raw_names <- c(" john SMITH ", "Sarah_Lee", "DR. Chen, Wei", "mike.brown@email.com") clean_name <- function(raw) { name <- str_trim(raw) if (str_detect(name, "@")) name <- str_split(name, "@")[[1]][1] name <- str_replace_all(name, "[_.]", " ") name <- str_remove(name, "DR\\. ") if (str_detect(name, ",")) { parts <- str_split(name, ", ")[[1]] name <- paste(parts[2], parts[1]) } str_to_title(name) } cleaned <- sapply(raw_names, clean_name) print(cleaned)

Guided

Exercise 1 — Text Analyzer

Run this text analysis code and study what each stringr function returns. Experiment by changing my_text.

library(stringr)
my_text <- "The quick brown fox jumps over the lazy dog. The dog barked loudly."

cat("Characters:", str_length(my_text), "\n")
cat("Word count:", str_count(my_text, "\\b\\w+\\b"), "\n")
cat("Contains 'fox':", str_detect(my_text, "fox"), "\n")
cat("'the' occurrences:", str_count(tolower(my_text), "\\bthe\\b"), "\n")

# Split into sentences
sentences <- str_split(my_text, "\\. ")[[1]]
cat("\nSentences:\n")
for (s in sentences) cat(" -", s, "\n")

# Replace 'dog' with 'cat'
cat("\nReplaced:", str_replace_all(my_text, "dog", "cat"), "\n")

# Extract all words longer than 4 chars
long_words <- str_extract_all(my_text, "\\b\\w{5,}\\b")[[1]]
cat("\nLong words:", paste(long_words, collapse=", "), "\n")

Output will appear here...

Hint: \\b in a regex pattern means “word boundary”. \\w+ matches one or more word characters. \\w{5,} matches 5 or more word characters (words with at least 5 letters).

Independent

Exercise 2 — Phone Number Formatter

Write a format_phone(raw) function that takes messy phone number strings and returns them in standard 213-555-0123 format. Test on all 4 formats below.

library(stringr)

format_phone <- function(raw) {
  # Step 1: remove all non-digit characters
  digits <- str_replace_all(raw, "[^0-9]", "")
  # Step 2: handle 11-digit numbers starting with 1
  if (nchar(digits) == 11 && substr(digits,1,1) == "1") digits <- substr(digits, 2, 11)
  # Step 3: format as XXX-XXX-XXXX
  if (nchar(digits) == 10) {
    paste0(substr(digits,1,3), "-", substr(digits,4,6), "-", substr(digits,7,10))
  } else {
    paste0("INVALID: ", raw)
  }
}

messy_phones <- c("(213) 555-0123", "213.555.0123", "12135550123", "213 555 0123")
cleaned <- sapply(messy_phones, format_phone)
for (i in seq_along(messy_phones)) {
  cat(sprintf("%-20s -> %s\n", messy_phones[i], cleaned[i]))
}

Output will appear here...

Hint: str_replace_all(x, "[^0-9]", "") removes everything that is NOT a digit. substr(digits, 1, 3) extracts the area code. nchar() gives the string length.

Challenge

Exercise 3 — Email List Parser

Write parse_email_list(emails) that takes a character vector of "Name <email@domain.com>" strings and returns a data frame with columns: display_name, username, domain.

library(stringr)

parse_email_list <- function(emails) {
  # Extract display name (everything before ' <')
  display_name <- str_trim(str_extract(emails, "^[^<]+"))
  # Extract full email (inside < >)
  full_email <- str_extract(emails, "(?<=<)[^>]+")
  # Split email into username and domain
  username <- str_extract(full_email, "^[^@]+")
  domain   <- str_extract(full_email, "(?<=@).+")
  data.frame(display_name, username, domain, stringsAsFactors=FALSE)
}

email_list <- c(
  "Alice Smith <alice.smith@university.edu>",
  "Bob Johnson <bjohnson@gmail.com>",
  "Dr. Carol Lee <carol.lee@research.org>",
  "David Kim <d.kim@company.co>"
)

result <- parse_email_list(email_list)
print(result)

Output will appear here...

Hint: (?<=<)[^>]+ is a lookbehind regex: match characters after < but before >. str_extract() returns NA if no match is found, which is helpful for error detection.

Mini Project — Text Analysis

Full Paragraph Analysis

Perform a complete text analysis of the paragraph below. Compute all 6 metrics and print a formatted report.

library(stringr)
text <- "Data science combines statistics, programming, and domain expertise to extract meaningful insights from data. Effective data scientists must communicate their findings clearly to both technical and non-technical audiences. The field has grown enormously in recent years, driven by the explosion of available data and advances in computing power."

# 1. Total word count
words <- str_split(tolower(str_replace_all(text, "[^a-zA-Z ]", " ")), "\\s+")[[1]]
words <- words[words != ""]
cat("Word count:", length(words), "\n")

# 2. Unique words
cat("Unique words:", length(unique(words)), "\n")

# 3. Top 5 words (excluding common stopwords)
stopwords <- c("the","a","an","and","or","but","in","on","at","to","for",
                "of","with","by","from","is","are","has","have","been","their","both")
content_words <- words[!words %in% stopwords]
top5 <- sort(table(content_words), decreasing=TRUE)[1:5]
cat("\nTop 5 words:\n"); print(top5)

# 4. Longest word
cat("\nLongest word:", words[which.max(nchar(words))], "\n")

# 5. Percentage of words longer than 6 characters
long_pct <- round(mean(nchar(words) > 6) * 100, 1)
cat("% words > 6 chars:", long_pct, "%\n")

# 6. Capitalized words (original text, starts with uppercase)
orig_words <- str_extract_all(text, "\\b[A-Z][a-z]+\\b")[[1]]
cat("Capitalized words:", paste(orig_words, collapse=", "), "\n")

Output will appear here...

Hint: table(words) counts word frequencies. sort(..., decreasing=TRUE)[1:5] gets the top 5. nchar() works on vectors — mean(nchar(words) > 6) gives the proportion of long words.

Lab 5 Complete!

You can now clean messy text, extract patterns with regex, and transform strings for real-world data cleaning tasks. String manipulation is essential for working with survey data, web scraping, and log files.

Continue to Lab 6: ggplot2 Visualization →

← Lab 4: Functions Lab 5 of 10Lab 6: ggplot2 →