Preparing Data
Raw data is messy — ML models need clean, numeric data in the right format
📌 Before You Start
- Modules 1 and 2 completed
- Comfortable with Python lists, loops, and dictionaries
Estimated time: ~50 minutes
What you’ll learn: Missing Values Label Encoding Feature Scaling Train/Test Split
💡 The Big Idea
Real-world data is rarely clean. It has missing values, text categories that models can’t read, and features on wildly different scales that confuse distance-based algorithms.
Data preparation (also called preprocessing) turns raw messy data into the structured numeric format that ML algorithms require. This step is not glamorous — but data scientists report spending 60-80% of their time here. Getting it right makes or breaks your model.
The four core preparation steps: handle missing values → encode categories → scale features → split into train and test.
🧠 How It Works
Step 1 — Handling Missing Values
Missing data appears as gaps, None, NaN, or empty cells. Two main strategies:
- Drop: Remove rows (or columns) with missing values. Safe when you have lots of data and few missing values.
- Impute (fill): Replace missing values with a calculated substitute. Common choices: mean (for numeric data), median (more robust to outliers), or most frequent value (for categories).
Step 2 — Encoding Categorical Features
ML algorithms work with numbers only. Categories like "sales", "tech", "hr" must be converted:
- Label Encoding: Assign each category a number (sales=0, tech=1, hr=2). Quick, but implies an ordering that may not exist.
- One-Hot Encoding: Create a new binary column for each category (is_sales, is_tech, is_hr). No ordering implied. Best for tree-based methods.
Step 3 — Feature Scaling
Consider a dataset with age (20–70) and salary (30,000–200,000). Distance-based algorithms like KNN would be dominated by salary just because the numbers are larger — even if age is equally important. Scaling fixes this:
- Min-Max Scaling (Normalization): Scale all features to [0, 1]. Formula:
(x - min) / (max - min) - Standardization (Z-score): Center at 0 with std=1. Formula:
(x - mean) / std. Better when data has outliers.
Step 4 — Train/Test Split
You need to know how your model performs on data it has never seen. That’s why you hold out a test set before training:
Important: Never use the test set for any decisions during development. If you do, you're leaking information and your results will be overly optimistic.
▶️ See It In Code
All 4 preprocessing steps in sequence on a simulated messy dataset.
This is a read-only example. The interactive exercise is below.
👋 Your Turn
The code above uses an 80/20 split. Change the split to 70/30 and print how many samples end up in each set. Also add a print statement showing the percentage in each set.
train_size / n * 100. Use an f-string like f"Training: {train_size / n * 100:.1f}%" to format it nicely.☕ Brain Break — 2 Minutes
Imagine you’re studying for a final exam. You have 100 practice problems.
- You study 80 of them — that’s your training set.
- You set aside 20 problems you’ve never seen — that’s your test set.
- The exam score on those 20 unseen problems tells you your true knowledge.
Now think: what would happen if you studied the test problems too? You’d do great on those 20 — but you wouldn’t know how you’d do on a truly new problem. This is exactly what data leakage means in ML.
The test set must stay sealed until the very end.
✅ Key Takeaways
- The 4 core preprocessing steps: Handle missing values → Encode categories → Scale features → Split data.
- Fill missing values with the mean (or median for skewed data) to avoid losing samples.
- ML models can only work with numbers — always encode text categories before training.
- Feature scaling is critical for distance-based models (like KNN) but less important for tree-based models.
- The test set must be completely untouched during development — it measures real-world performance.
🎉 Module 3 Complete!
You can now clean and prepare a dataset. In Module 4, you’ll use your prepared data to train your very first classifier!