Preparing Data

Raw data is messy — ML models need clean, numeric data in the right format

← Module 2: Your First Dataset Module 3 of 8 Module 4: Your First Classifier →

Loading Python… (first load ~15 seconds)

Before You Start

Modules 1 and 2 completed
Comfortable with Python lists, loops, and dictionaries

Estimated time: ~50 minutes

What you’ll learn: Missing Values Label Encoding Feature Scaling Train/Test Split

The Big Idea

Real-world data is rarely clean. It has missing values, text categories that models can’t read, and features on wildly different scales that confuse distance-based algorithms.

Data preparation (also called preprocessing) turns raw messy data into the structured numeric format that ML algorithms require. This step is not glamorous — but data scientists report spending 60-80% of their time here. Getting it right makes or breaks your model.

The four core preparation steps: handle missing values → encode categories → scale features → split into train and test.

How It Works

Step 1 — Handling Missing Values

Missing data appears as gaps, None, NaN, or empty cells. Two main strategies:

Drop: Remove rows (or columns) with missing values. Safe when you have lots of data and few missing values.
Impute (fill): Replace missing values with a calculated substitute. Common choices: mean(for numeric data), median (more robust to outliers), or most frequent value (for categories).

Step 2 — Encoding Categorical Features

ML algorithms work with numbers only. Categories like "sales", "tech", "hr" must be converted:

Label Encoding: Assign each category a number (sales=0, tech=1, hr=2). Quick, but implies an ordering that may not exist.
One-Hot Encoding: Create a new binary column for each category (is_sales, is_tech, is_hr). No ordering implied. Best for tree-based methods.

Step 3 — Feature Scaling

Consider a dataset with age (20–70) and salary (30,000–200,000). Distance-based algorithms like KNN would be dominated by salary just because the numbers are larger — even if age is equally important. Scaling fixes this:

Min-Max Scaling (Normalization): Scale all features to [0, 1]. Formula: (x - min) / (max - min)
Standardization (Z-score): Center at 0 with std=1. Formula: (x - mean) / std. Better when data has outliers.

Step 4 — Train/Test Split

You need to know how your model performs on data it has never seen. That’s why you hold out a test set before training:

80%

Training set — The model learns from this. Larger = more learning material.

20%

Test set — Held out completely until final evaluation. Never shown to the model during training. This is the true performance measure.

Important: Never use the test set for any decisions during development. If you do, you're leaking information and your results will be overly optimistic.

▶ See It In Code

All 4 preprocessing steps in sequence on a simulated messy dataset.

import numpy as np # ---- Simulated messy dataset ---- data = { 'age': [25, None, 35, 28, 42, None, 31], 'salary': [50000, 60000, None, 45000, 75000, 55000, 62000], 'department': ['sales', 'tech', 'sales', 'hr', 'tech', 'hr', 'tech'], 'promoted': [0, 1, 0, 0, 1, 0, 1] } n = len(data['promoted']) # ---- STEP 1: Check and handle missing values ---- print("=== Step 1: Missing Values ===") for col, vals in data.items(): missing = sum(1 for v in vals if v is None) print(f" {col}: {missing} missing") ages = [v for v in data['age'] if v is not None] mean_age = sum(ages) / len(ages) data['age'] = [v if v is not None else mean_age for v in data['age']] salaries = [v for v in data['salary'] if v is not None] mean_salary = sum(salaries) / len(salaries) data['salary'] = [v if v is not None else mean_salary for v in data['salary']] print(f"\nFilled age with mean: {mean_age:.1f}") print(f"Filled salary with mean: ${mean_salary:,.0f}") # ---- STEP 2: Encode categorical feature ---- print("\n=== Step 2: Encoding Department ===") dept_map = {'sales': 0, 'tech': 1, 'hr': 2} data['dept_encoded'] = [dept_map[d] for d in data['department']] print(f" Mapping: {dept_map}") print(f" Encoded: {data['dept_encoded']}") # ---- STEP 3: Scale features (min-max) ---- print("\n=== Step 3: Feature Scaling (Min-Max) ===") for col in ['age', 'salary']: vals = data[col] mn, mx = min(vals), max(vals) scaled = [(v - mn) / (mx - mn) for v in vals] data[f'{col}_scaled'] = scaled print(f" {col}: original range [{mn:.0f}, {mx:.0f}] → scaled range [0.0, 1.0]") print(f" First 3 scaled: {[round(s, 3) for s in scaled[:3]]}") # ---- STEP 4: Train/test split (80/20) ---- print("\n=== Step 4: Train/Test Split (80/20) ===") split = int(n * 0.8) print(f" Total samples: {n}") print(f" Training: {split} samples (indices 0 to {split-1})") print(f" Test: {n - split} samples (indices {split} to {n-1})")

This is a read-only example. The interactive exercise is below.

Your Turn

The code above uses an 80/20 split. Change the split to 70/30 and print how many samples end up in each set. Also add a print statement showing the percentage in each set.

data = {
    'age':        [25, 30, 35, 28, 42, 22, 31, 45, 38, 29],
    'salary':     [50000, 60000, 70000, 45000, 75000, 48000, 62000, 80000, 68000, 52000],
    'promoted':   [0, 1, 0, 0, 1, 0, 1, 1, 0, 0]
}
n = len(data['promoted'])

# YOUR CODE HERE:
# 1. Change the split ratio to 70/30 (train/test)
# 2. Calculate train_size and test_size
# 3. Print both sizes
# 4. Print the percentage in each set (e.g., "Training: 70.0%")

split_ratio = 0.70  # change this if needed
train_size = int(n * split_ratio)
test_size = n - train_size

print(f"Total samples: {n}")
# Add your print statements below:

Output will appear here after you click Run…

Hint: To get the percentage, use train_size / n * 100. Use an f-string like f"Training: {train_size / n * 100:.1f}%" to format it nicely.

Brain Break — 2 Minutes

Imagine you’re studying for a final exam. You have 100 practice problems.

You study 80 of them — that’s your training set.
You set aside 20 problems you’ve never seen — that’s your test set.
The exam score on those 20 unseen problems tells you your true knowledge.

Now think: what would happen if you studied the test problems too? You’d do great on those 20 — but you wouldn’t know how you’d do on a truly new problem. This is exactly what data leakage means in ML.

The test set must stay sealed until the very end.

Key Takeaways

The 4 core preprocessing steps: Handle missing values → Encode categories → Scale features → Split data.
Fill missing values with the mean (or median for skewed data) to avoid losing samples.
ML models can only work with numbers — always encode text categories before training.
Feature scaling is critical for distance-based models (like KNN) but less important for tree-based models.
The test set must be completely untouched during development — it measures real-world performance.

Module 3 Complete!

You can now clean and prepare a dataset. In Module 4, you’ll use your prepared data to train your very first classifier!

Continue to Module 4: Your First Classifier →

← Module 2: Your First Dataset Module 3 of 8 Module 4: Your First Classifier →