Learn Without Walls
← Module 06: Charts & Visualization 🏠 Course Home Module 08: Capstone Project →
Module 07 of 08  |  Excel & Google Sheets for Data

🧹 Data Cleaning

Real-world data is messy — inconsistent capitalization, extra spaces, blank cells, duplicates, and formatting errors. Before you can analyze anything, you need to clean it. This module gives you the tools.

⏰ ~40 minutes 🎓 Intermediate 📋 Requires: Modules 01–03
📌 Before You Start
Why data cleaning matters: Garbage In, Garbage Out

A classic saying in data work: GIGO — Garbage In, Garbage Out. If your source data has errors, every formula, chart, and pivot table built on it will reflect those errors. A COUNTIF looking for "Electronics" won't find " electronics " (extra space, wrong case). A VLOOKUP looking for "P001" won't find "P001 " (trailing space).

Studies estimate that data analysts spend 60–80% of their time cleaning data. Learning to do it efficiently is one of the highest-value spreadsheet skills you can develop.

📋 Sample Data — The Messy Dataset
Enter this intentionally dirty data into a new sheet

The red cells show problems. Your job is to clean them.

A — NameB — EmailC — CategoryD — Score
  alice johnson   alice@example.com ELECTRONICS 88
BOB MARTINEZ bob@example.com  books 74
Chloe Kim chloe@example.com Clothing   95
david okafor david@example.com Electronics (blank)
 EMMA PATEL  emma@example.com BOOKS 67
Felix Nguyen felix@example.com Electronics 90
grace LEE grace@example.com clothing 55
Hassan Ali hassan@example.com Books 78
Alice Johnson alice@example.com Electronics 88

Issues to fix: extra spaces (TRIM), inconsistent capitalization (PROPER/UPPER/LOWER), a blank Score cell (IFERROR/ISBLANK), and one duplicate row (Alice Johnson appears twice).

💡 The Concept
The Data Cleaning Toolkit — 8 Essential Functions
TRIM
Removes leading, trailing, and extra internal spaces. =TRIM(A2)
" alice johnson " → "alice johnson". This is the most commonly needed cleaning function.
CLEAN
Removes non-printable characters (control characters, line breaks embedded in cells). =CLEAN(A2)
Often needed after importing data from PDFs, databases, or old systems.
PROPER
Capitalizes the first letter of each word. =PROPER(A2)
"alice johnson" → "Alice Johnson". "BOB MARTINEZ" → "Bob Martinez".
UPPER
Converts all text to uppercase. =UPPER(A2)
"alice johnson" → "ALICE JOHNSON". Use when standards require all-caps (e.g., product codes).
LOWER
Converts all text to lowercase. =LOWER(A2)
"ALICE JOHNSON" → "alice johnson". Useful before comparing text in formulas (case-insensitive matching).
TEXT
Formats a number as text in a specific pattern. =TEXT(A2, "MM/DD/YYYY")
Converts dates, numbers to formatted strings. =TEXT(1234.5, "#,##0.00") → "1,234.50".
ISBLANK
Returns TRUE if a cell is empty. =ISBLANK(D2)
Use with IF: =IF(ISBLANK(D2), "Missing", D2) → replaces blanks with "Missing".
IFERROR
Catches formula errors and replaces them with a friendly message. =IFERROR(VLOOKUP(...), "Not Found")
Instead of #N/A or #DIV/0!, show something meaningful.
🔧 Step-by-Step
Combining Cleaning Functions

The real power comes from nesting cleaning functions together. You can fix multiple issues in one formula:

Fix leading/trailing spaces AND capitalize properly: =PROPER(TRIM(A2)) → " alice johnson " → "Alice Johnson" Fix spaces, capitalization, AND handle errors: =IFERROR(PROPER(TRIM(A2)), "Invalid") Fix a whole name column with spaces and weird capitalization: =TRIM(PROPER(A2))

Work from the inside out — the innermost function runs first. So =PROPER(TRIM(A2)) first trims A2, then applies PROPER to the trimmed result.

How to clean a column without overwriting the original data:

  1. Add a new column next to your dirty column — for example, if A has dirty names, use column E for clean names.
  2. In E2, type: =PROPER(TRIM(A2)). Press Enter.
  3. Copy E2 down for all rows.
  4. Once satisfied, select E2:E10, press Ctrl+C to copy.
  5. Click cell A2. Go to Edit → Paste Special → Values only (or Ctrl+Shift+V). This pastes the cleaned text, not the formulas.
  6. Now you can delete column E — the original column A now has clean values.
Why Paste Special → Values? If you just paste normally, the formulas in column E would paste as formulas referencing column A — creating a circular nightmare. Values-only paste converts them to plain text first.
🔧 Step-by-Step
Find & Replace: Fix Systematic Errors Instantly

If "Electronics" was consistently misspelled as "Elecronics" across 500 rows, Find & Replace fixes all 500 in one operation.

  1. Press Ctrl+H (Windows) or Cmd+H (Mac) to open Find & Replace.
  2. In the "Find" field, type: Elecronics.
  3. In the "Replace with" field, type: Electronics.
  4. Click "Replace all". A confirmation shows how many replacements were made.
  5. You can also use it to fix inconsistent spellings: "books" → "Books", "BOOKS" → "Books".
  6. Check "Match case" if you need case-sensitive replacement.
🔧 Step-by-Step
Removing Duplicate Rows

Duplicate rows appear in data exports, merged datasets, or when data is entered twice. Removing them is a built-in one-click operation.

  1. Click any cell inside your data table.
  2. Go to Data → Data cleanup → Remove duplicates.
  3. A dialog appears. Check "Data has header row" if your first row has headers.
  4. Select which columns to use for duplicate detection. Checking all columns means a row is only a duplicate if every column matches. Checking just "Name" means any row with the same name is considered a duplicate.
  5. Click Remove duplicates. A confirmation shows how many rows were removed.
Excel equivalent: Select your data → Data tab → Remove Duplicates → choose columns → OK. Same behavior, different location.
Before removing duplicates: Make a backup copy of your sheet first (right-click the sheet tab → Duplicate). Always confirm the right rows are being removed before deleting.
🔧 Step-by-Step
Handling Blank Cells

Blank cells cause issues in formulas and pivot tables. Here's how to handle them:

Find all blank cells:

  1. Select the range you want to check (e.g., D2:D10 for Score column).
  2. Press Ctrl+G (Go to). In Excel: Home → Find & Select → Go to Special → Blanks.
  3. In Google Sheets: use Ctrl+H (Find & Replace), leave Find empty, type "0" (or "N/A") in Replace with, check "Match entire cell contents".

Fill blanks with a formula:

Replace blank with "N/A": =IF(ISBLANK(D2), "N/A", D2) Replace blank with the column average: =IF(ISBLANK(D2), AVERAGE(D$2:D$10), D2) Handle division-by-zero errors: =IFERROR(B2/C2, 0) ← shows 0 if C2 is blank or zero
🔧 The Data Cleaning Workflow
A Repeatable Process for Any Dirty Dataset
Inspect first. Scroll through the data. Look for obvious issues: mixed capitalization, extra spaces, errors (#N/A, #VALUE!), blank rows/columns.
Make a backup copy. Right-click the sheet tab → Duplicate. Work on the copy, not the original.
Fix spaces with TRIM. Apply to all text columns, especially Name, Category, and any lookup keys.
Fix capitalization with PROPER/UPPER/LOWER. Standardize all category and name columns to consistent case.
Fix systematic errors with Find & Replace. Correct known misspellings and inconsistent labels.
Handle blanks. Decide: fill with "N/A", fill with averages, or flag for manual review.
Remove duplicates. Check for duplicate rows after confirming your criteria.
Validate. Use COUNTA, COUNT, and COUNTIF to confirm row counts before and after. Spot-check a few rows manually.
Convert formulas to values. Paste Special → Values Only to lock in the cleaned data without formulas.
🖐 Your Turn
Exercise: Clean the Messy Dataset

Using the dirty dataset from the beginning of this module:

  1. Duplicate the sheet before you start (right-click tab → Duplicate). Call it "Cleaned".
  2. In a new column E, write a formula that applies PROPER(TRIM(A2)) to fix names. Copy it down for all rows.
  3. In column F, apply PROPER(TRIM(C2)) to standardize Category values.
  4. In column G, use IF(ISBLANK(D2), "Missing", D2) to handle the blank Score.
  5. Paste Special → Values only for columns E, F, G back into columns A, C, D.
  6. Use Find & Replace to fix any remaining category inconsistencies.
  7. Use Data → Remove duplicates to remove the duplicate Alice Johnson row.
  8. After cleaning, run a COUNTIF to verify: how many "Electronics", "Books", and "Clothing" entries do you have?

Expected result after cleaning:

NameEmailCategoryScore
Alice Johnsonalice@example.comElectronics88
Bob Martinezbob@example.comBooks74
Chloe Kimchloe@example.comClothing95
David Okafordavid@example.comElectronicsMissing
Emma Patelemma@example.comBooks67
Felix Nguyenfelix@example.comElectronics90
Grace Leegrace@example.comClothing55
Hassan Alihassan@example.comBooks78
Hint: After removing the duplicate Alice row, you should have 8 rows of data (9 original − 1 duplicate = 8).
🧠 Brain Break
The Billion-Dollar Spreadsheet Error

In 2012, JPMorgan Chase lost more than $6 billion in trading losses — and a key contributing factor was a spreadsheet error. An analyst had copy-pasted a formula that divided instead of averaging due to a row insertion. The value-at-risk model produced numbers that were half of what they should have been, masking the true risk of a massive trading position.

Data cleaning and validation aren't just best practices — they're how you prevent catastrophic mistakes. Always validate your data, check your formulas, and never skip the "inspect first" step.

✅ Wrap Up
Module 07 Key Takeaways

Next: Module 08 — Capstone Project →