🧹 Data Cleaning
Real-world data is messy — inconsistent capitalization, extra spaces, blank cells, duplicates, and formatting errors. Before you can analyze anything, you need to clean it. This module gives you the tools.
A classic saying in data work: GIGO — Garbage In, Garbage Out. If your source data has errors, every formula, chart, and pivot table built on it will reflect those errors. A COUNTIF looking for "Electronics" won't find " electronics " (extra space, wrong case). A VLOOKUP looking for "P001" won't find "P001 " (trailing space).
Studies estimate that data analysts spend 60–80% of their time cleaning data. Learning to do it efficiently is one of the highest-value spreadsheet skills you can develop.
The red cells show problems. Your job is to clean them.
| A — Name | B — Email | C — Category | D — Score |
|---|---|---|---|
| alice johnson | alice@example.com | ELECTRONICS | 88 |
| BOB MARTINEZ | bob@example.com | books | 74 |
| Chloe Kim | chloe@example.com | Clothing | 95 |
| david okafor | david@example.com | Electronics | (blank) |
| EMMA PATEL | emma@example.com | BOOKS | 67 |
| Felix Nguyen | felix@example.com | Electronics | 90 |
| grace LEE | grace@example.com | clothing | 55 |
| Hassan Ali | hassan@example.com | Books | 78 |
| Alice Johnson | alice@example.com | Electronics | 88 |
Issues to fix: extra spaces (TRIM), inconsistent capitalization (PROPER/UPPER/LOWER), a blank Score cell (IFERROR/ISBLANK), and one duplicate row (Alice Johnson appears twice).
=TRIM(A2)" alice johnson " → "alice johnson". This is the most commonly needed cleaning function.
=CLEAN(A2)Often needed after importing data from PDFs, databases, or old systems.
=PROPER(A2)"alice johnson" → "Alice Johnson". "BOB MARTINEZ" → "Bob Martinez".
=UPPER(A2)"alice johnson" → "ALICE JOHNSON". Use when standards require all-caps (e.g., product codes).
=LOWER(A2)"ALICE JOHNSON" → "alice johnson". Useful before comparing text in formulas (case-insensitive matching).
=TEXT(A2, "MM/DD/YYYY")Converts dates, numbers to formatted strings.
=TEXT(1234.5, "#,##0.00") → "1,234.50".=ISBLANK(D2)Use with IF:
=IF(ISBLANK(D2), "Missing", D2) → replaces blanks with "Missing".=IFERROR(VLOOKUP(...), "Not Found")Instead of #N/A or #DIV/0!, show something meaningful.
The real power comes from nesting cleaning functions together. You can fix multiple issues in one formula:
Work from the inside out — the innermost function runs first. So =PROPER(TRIM(A2)) first trims A2, then applies PROPER to the trimmed result.
How to clean a column without overwriting the original data:
- Add a new column next to your dirty column — for example, if A has dirty names, use column E for clean names.
- In E2, type:
=PROPER(TRIM(A2)). Press Enter. - Copy E2 down for all rows.
- Once satisfied, select E2:E10, press Ctrl+C to copy.
- Click cell A2. Go to Edit → Paste Special → Values only (or Ctrl+Shift+V). This pastes the cleaned text, not the formulas.
- Now you can delete column E — the original column A now has clean values.
If "Electronics" was consistently misspelled as "Elecronics" across 500 rows, Find & Replace fixes all 500 in one operation.
- Press Ctrl+H (Windows) or Cmd+H (Mac) to open Find & Replace.
- In the "Find" field, type: Elecronics.
- In the "Replace with" field, type: Electronics.
- Click "Replace all". A confirmation shows how many replacements were made.
- You can also use it to fix inconsistent spellings: "books" → "Books", "BOOKS" → "Books".
- Check "Match case" if you need case-sensitive replacement.
Duplicate rows appear in data exports, merged datasets, or when data is entered twice. Removing them is a built-in one-click operation.
- Click any cell inside your data table.
- Go to Data → Data cleanup → Remove duplicates.
- A dialog appears. Check "Data has header row" if your first row has headers.
- Select which columns to use for duplicate detection. Checking all columns means a row is only a duplicate if every column matches. Checking just "Name" means any row with the same name is considered a duplicate.
- Click Remove duplicates. A confirmation shows how many rows were removed.
Blank cells cause issues in formulas and pivot tables. Here's how to handle them:
Find all blank cells:
- Select the range you want to check (e.g., D2:D10 for Score column).
- Press Ctrl+G (Go to). In Excel: Home → Find & Select → Go to Special → Blanks.
- In Google Sheets: use Ctrl+H (Find & Replace), leave Find empty, type "0" (or "N/A") in Replace with, check "Match entire cell contents".
Fill blanks with a formula:
Using the dirty dataset from the beginning of this module:
- Duplicate the sheet before you start (right-click tab → Duplicate). Call it "Cleaned".
- In a new column E, write a formula that applies
PROPER(TRIM(A2))to fix names. Copy it down for all rows. - In column F, apply
PROPER(TRIM(C2))to standardize Category values. - In column G, use
IF(ISBLANK(D2), "Missing", D2)to handle the blank Score. - Paste Special → Values only for columns E, F, G back into columns A, C, D.
- Use Find & Replace to fix any remaining category inconsistencies.
- Use Data → Remove duplicates to remove the duplicate Alice Johnson row.
- After cleaning, run a COUNTIF to verify: how many "Electronics", "Books", and "Clothing" entries do you have?
Expected result after cleaning:
| Name | Category | Score | |
|---|---|---|---|
| Alice Johnson | alice@example.com | Electronics | 88 |
| Bob Martinez | bob@example.com | Books | 74 |
| Chloe Kim | chloe@example.com | Clothing | 95 |
| David Okafor | david@example.com | Electronics | Missing |
| Emma Patel | emma@example.com | Books | 67 |
| Felix Nguyen | felix@example.com | Electronics | 90 |
| Grace Lee | grace@example.com | Clothing | 55 |
| Hassan Ali | hassan@example.com | Books | 78 |
In 2012, JPMorgan Chase lost more than $6 billion in trading losses — and a key contributing factor was a spreadsheet error. An analyst had copy-pasted a formula that divided instead of averaging due to a row insertion. The value-at-risk model produced numbers that were half of what they should have been, masking the true risk of a massive trading position.
Data cleaning and validation aren't just best practices — they're how you prevent catastrophic mistakes. Always validate your data, check your formulas, and never skip the "inspect first" step.
TRIM(A1)removes extra spaces. Use it on every imported text column, always.PROPER(A1)capitalizes first letter of each word.UPPERandLOWERconvert entirely.CLEAN(A1)removes non-printable characters — critical for PDF or database exports.TEXT(A1, "format")converts numbers and dates to formatted text.ISBLANK(A1)detects empty cells.IFERROR(formula, "fallback")catches errors.- Find & Replace (Ctrl+H) fixes systematic errors across entire columns instantly.
- Remove duplicates is under Data → Data cleanup in Google Sheets.
- Always make a backup copy before cleaning. Always paste values-only when replacing originals.