Data Cleaning — From Messy to Ready
Real data is always messy. Learning to find and fix it is 60% of the job.
~20 minutesWhat you need: Open Google Sheets at sheets.google.com. Have your Module 2 SQL knowledge fresh — you will see SQL terms show up again in this module.
What you’ll do: You’ll manually enter a messy dataset (with intentional errors built in), identify every problem, and fix each one. This is exactly what you do in a real job — except the dataset is bigger.
Real data is messy. Always.
When data comes from the real world, it comes with problems:
- Duplicates — the same row entered twice (or three times)
- Missing values — blank cells where there should be data
- Inconsistent formatting — “south” and “South” and “SOUTH” all meaning the same thing
- Wrong data types — a number stored as text, so you can’t calculate with it (e.g., “750abc”)
- Date format chaos — Jan 17 2024, 2024-01-17, and 01/17/2024 all in the same column
Data analysts spend 60–80% of their time cleaning data before they can analyze it. This is not a bug — it is the job. New analysts are often surprised by this. Now you know.
Key phrase: “Garbage in, garbage out.” If you analyze dirty data, your insights are wrong. A trusted analyst is one whose work can be relied on — and that starts with clean data.
Data cleaning is the difference between an analyst whose work gets used and one whose work gets questioned. Every decision a company makes based on your analysis is only as good as the data you cleaned.
In SQL (which you learned in Module 2), dirty data shows up as NULL values and requires DISTINCT to remove duplicates. In Python (Phase 3), pandas has methods specifically for this. In every tool, data cleaning is the first step. Always.
You are going to enter a messy dataset on purpose — errors and all — and then fix it.
| ID | Customer | Sales | Date | Region |
|---|---|---|---|---|
| 1 | John Smith | 500 | 2024-01-15 | North |
| 2 | john smith | 500 | 2024-01-15 | North |
| 3 | Sarah Lee | (leave blank) | 2024-01-16 | south |
| 4 | Mike Brown | 750abc | Jan 17 2024 | SOUTH |
| 5 | Sarah Lee | 300 | 2024-01-18 | South |
- 🔴 Rows 1 and 2 are the same customer (John Smith / john smith) with the same sale on the same date — this is a duplicate.
- 🟡 Row 3 (Sarah Lee) has no Sales value — this is a missing value (NULL in SQL terms).
- 🔵 “south”, “SOUTH”, and “South” are all in the Region column — this is inconsistent capitalization.
- 🔵 “750abc” in Row 4 is a number with text mixed in — this is a wrong data type.
- 🔵 “Jan 17 2024” in Row 4 is a different date format than the rest — this is an inconsistent date format.
- Duplicate: Delete row 2 (the lowercase “john smith” row). Keep only one record.
- Wrong data type: Fix “750abc” → change to 750.
- Inconsistent Region: Change “south” and “SOUTH” to South. All five rows should now say “South” or “North” consistently.
- Date format: Change “Jan 17 2024” to 2024-01-17 to match the other dates.
- Missing value: Leave the blank Sales cell as-is for now. In a real job, you would go back to the source data or contact the person who entered it. Never guess at missing values and fill them in.
Data cleaning is detail-focused work. Your brain just shifted into a careful, methodical mode. That is valuable — and it is tiring. Close your eyes for 30 seconds. Notice the quiet.
The ONE thing to remember from this module:
What comes next: Module 4 puts SQL and Sheets together in the complete analyst workflow: query → export → calculate → visualize. This is the daily flow of the job.