Data Cleaning — From Messy to Ready

Real data is always messy. Learning to find and fix it is 60% of the job.

~20 minutes

Before You Start

What you need: Open Google Sheets at sheets.google.com. Have your Module 2 SQL knowledge fresh — you will see SQL terms show up again in this module.

What you’ll do: You’ll manually enter a messy dataset (with intentional errors built in), identify every problem, and fix each one. This is exactly what you do in a real job — except the dataset is bigger.

The Concept

Real data is messy. Always.

When data comes from the real world, it comes with problems:

Duplicates — the same row entered twice (or three times)
Missing values — blank cells where there should be data
Inconsistent formatting — “south” and “South” and “SOUTH” all meaning the same thing
Wrong data types — a number stored as text, so you can’t calculate with it (e.g., “750abc”)
Date format chaos — Jan 17 2024, 2024-01-17, and 01/17/2024 all in the same column

Data analysts spend 60–80% of their time cleaning data before they can analyze it. This is not a bug — it is the job. New analysts are often surprised by this. Now you know.

Key phrase: “Garbage in, garbage out.” If you analyze dirty data, your insights are wrong. A trusted analyst is one whose work can be relied on — and that starts with clean data.

Why It Matters

Data cleaning is the difference between an analyst whose work gets used and one whose work gets questioned. Every decision a company makes based on your analysis is only as good as the data you cleaned.

In SQL (which you learned in Module 2), dirty data shows up as NULL values and requires DISTINCT to remove duplicates. In Python (Phase 3), pandas has methods specifically for this. In every tool, data cleaning is the first step. Always.

Practice

You are going to enter a messy dataset on purpose — errors and all — and then fix it.

In Google Sheets, create a new sheet. Title it: Messy Sales Data

Enter this data exactly as shown — including the errors. (The highlighted cells contain the intentional problems you will find and fix.)

ID	Customer	Sales	Date	Region
1	John Smith	500	2024-01-15	North
2	john smith	500	2024-01-15	North
3	Sarah Lee	(leave blank)	2024-01-16	south
4	Mike Brown	750abc	Jan 17 2024	SOUTH
5	Sarah Lee	300	2024-01-18	South

Before fixing anything — identify every problem first. Write them down. Here is what you should find:

Rows 1 and 2 are the same customer (John Smith / john smith) with the same sale on the same date — this is a duplicate.
Row 3 (Sarah Lee) has no Sales value — this is a missing value (NULL in SQL terms).
“south”, “SOUTH”, and “South” are all in the Region column — this is inconsistent capitalization.
“750abc” in Row 4 is a number with text mixed in — this is a wrong data type.
“Jan 17 2024” in Row 4 is a different date format than the rest — this is an inconsistent date format.

Fix each problem:

Duplicate: Delete row 2 (the lowercase “john smith” row). Keep only one record.
Wrong data type: Fix “750abc” → change to 750.
Inconsistent Region: Change “south” and “SOUTH” to South. All five rows should now say “South” or “North” consistently.
Date format: Change “Jan 17 2024” to 2024-01-17 to match the other dates.
Missing value: Leave the blank Sales cell as-is for now. In a real job, you would go back to the source data or contact the person who entered it. Never guess at missing values and fill them in.

After cleaning, your data should have 4 rows (not 5), consistent region capitalization, numeric-only sales values, and consistent date formatting. Compare to what you started with.

Good stopping point. Bookmark and come back whenever you are ready.

Brain Break

Data cleaning is detail-focused work. Your brain just shifted into a careful, methodical mode. That is valuable — and it is tiring. Close your eyes for 30 seconds. Notice the quiet.

Close your eyes 30 seconds Roll your shoulders Breathe slowly Look away from the screen

You Got This

The ONE thing to remember from this module:

Clean data is honest data. Finding and fixing errors IS the work — it is not a detour, it is the path.

What comes next: Module 4 puts SQL and Sheets together in the complete analyst workflow: query → export → calculate → visualize. This is the daily flow of the job.

← Module 2: SQL Foundations Course Home Module 4: SQL + Sheets →