Learn Without Walls
← Back to Machine Learning Basics
Module 3 of 8 — Machine Learning Basics

Preparing Data

Raw data is messy — ML models need clean, numeric data in the right format

← Module 2: Your First Dataset Module 3 of 8 Module 4: Your First Classifier →
⏳ Loading Python… (first load ~15 seconds)

📌 Before You Start

Estimated time: ~50 minutes

What you’ll learn: Missing Values Label Encoding Feature Scaling Train/Test Split

💡 The Big Idea

Real-world data is rarely clean. It has missing values, text categories that models can’t read, and features on wildly different scales that confuse distance-based algorithms.

Data preparation (also called preprocessing) turns raw messy data into the structured numeric format that ML algorithms require. This step is not glamorous — but data scientists report spending 60-80% of their time here. Getting it right makes or breaks your model.

The four core preparation steps: handle missing values → encode categories → scale features → split into train and test.

🧠 How It Works

Step 1 — Handling Missing Values

Missing data appears as gaps, None, NaN, or empty cells. Two main strategies:

Step 2 — Encoding Categorical Features

ML algorithms work with numbers only. Categories like "sales", "tech", "hr" must be converted:

Step 3 — Feature Scaling

Consider a dataset with age (20–70) and salary (30,000–200,000). Distance-based algorithms like KNN would be dominated by salary just because the numbers are larger — even if age is equally important. Scaling fixes this:

Step 4 — Train/Test Split

You need to know how your model performs on data it has never seen. That’s why you hold out a test set before training:

80%
Training set — The model learns from this. Larger = more learning material.
20%
Test set — Held out completely until final evaluation. Never shown to the model during training. This is the true performance measure.

Important: Never use the test set for any decisions during development. If you do, you're leaking information and your results will be overly optimistic.

▶️ See It In Code

All 4 preprocessing steps in sequence on a simulated messy dataset.

import numpy as np # ---- Simulated messy dataset ---- data = { 'age': [25, None, 35, 28, 42, None, 31], 'salary': [50000, 60000, None, 45000, 75000, 55000, 62000], 'department': ['sales', 'tech', 'sales', 'hr', 'tech', 'hr', 'tech'], 'promoted': [0, 1, 0, 0, 1, 0, 1] } n = len(data['promoted']) # ---- STEP 1: Check and handle missing values ---- print("=== Step 1: Missing Values ===") for col, vals in data.items(): missing = sum(1 for v in vals if v is None) print(f" {col}: {missing} missing") ages = [v for v in data['age'] if v is not None] mean_age = sum(ages) / len(ages) data['age'] = [v if v is not None else mean_age for v in data['age']] salaries = [v for v in data['salary'] if v is not None] mean_salary = sum(salaries) / len(salaries) data['salary'] = [v if v is not None else mean_salary for v in data['salary']] print(f"\nFilled age with mean: {mean_age:.1f}") print(f"Filled salary with mean: ${mean_salary:,.0f}") # ---- STEP 2: Encode categorical feature ---- print("\n=== Step 2: Encoding Department ===") dept_map = {'sales': 0, 'tech': 1, 'hr': 2} data['dept_encoded'] = [dept_map[d] for d in data['department']] print(f" Mapping: {dept_map}") print(f" Encoded: {data['dept_encoded']}") # ---- STEP 3: Scale features (min-max) ---- print("\n=== Step 3: Feature Scaling (Min-Max) ===") for col in ['age', 'salary']: vals = data[col] mn, mx = min(vals), max(vals) scaled = [(v - mn) / (mx - mn) for v in vals] data[f'{col}_scaled'] = scaled print(f" {col}: original range [{mn:.0f}, {mx:.0f}] → scaled range [0.0, 1.0]") print(f" First 3 scaled: {[round(s, 3) for s in scaled[:3]]}") # ---- STEP 4: Train/test split (80/20) ---- print("\n=== Step 4: Train/Test Split (80/20) ===") split = int(n * 0.8) print(f" Total samples: {n}") print(f" Training: {split} samples (indices 0 to {split-1})") print(f" Test: {n - split} samples (indices {split} to {n-1})")

This is a read-only example. The interactive exercise is below.

👋 Your Turn

The code above uses an 80/20 split. Change the split to 70/30 and print how many samples end up in each set. Also add a print statement showing the percentage in each set.

Output will appear here after you click Run…
💡 Hint: To get the percentage, use train_size / n * 100. Use an f-string like f"Training: {train_size / n * 100:.1f}%" to format it nicely.

☕ Brain Break — 2 Minutes

Imagine you’re studying for a final exam. You have 100 practice problems.

Now think: what would happen if you studied the test problems too? You’d do great on those 20 — but you wouldn’t know how you’d do on a truly new problem. This is exactly what data leakage means in ML.

The test set must stay sealed until the very end.

✅ Key Takeaways

🎉 Module 3 Complete!

You can now clean and prepare a dataset. In Module 4, you’ll use your prepared data to train your very first classifier!

Continue to Module 4: Your First Classifier →

← Module 2: Your First Dataset Module 3 of 8 Module 4: Your First Classifier →