Save or print this lesson:

Module 8 of 8 — Machine Learning Basics

Capstone Project

Build Your Own ML Pipeline — from raw data to interpreted results

← Module 7: Evaluating Models Module 8 of 8 — Final!

Loading Python… (first load ~15 seconds)

Before You Start

All 7 previous modules completed
You are comfortable with KNN, Decision Trees, and classification metrics

Estimated time: ~60 minutes

What you’ll do: Complete a full ML pipeline independently, compare three models, justify your choice, and make final predictions. A collapsible sample solution is provided at the end.

The Project Goal

Use everything you’ve learned to build a complete, end-to-end machine learning pipeline on the Iris dataset. You will make real decisions: which k to use for KNN, which depth for your Decision Tree, and which model to select as your final answer.

This mirrors what real ML practitioners do on every project — just with more data and more features. The workflow never changes.

Your deliverables: A working pipeline with three trained models, a comparison table of their results, a written justification for your model selection, and 3 new predictions.

The Pipeline Steps

▶ Step 1

Load & Explore the Data

Load the Iris dataset. Print the shape, class names, class distribution, and feature value ranges (min/max) for all 4 features. Confirm there are no missing values.

▶ Step 2

Split Into Train and Test

Use an 80/20 train/test split with random_state=42. Print the number of samples in each set.

▶ Step 3

Train Three Models

Model A: KNN with k=5
Model B: Decision Tree with max_depth=3
Model C: A third model of your choice. Options: KNN with a different k, Decision Tree with a different depth, or try GaussianNB from sklearn.naive_bayes.

▶ Step 4

Evaluate All Three

For each model, print: accuracy, and the classification report (precision, recall, F1 per class). Print a comparison table.

▶ Step 5

Select the Best Model & Explain Why

Look at your results. Which model would you deploy? Add a print statement explaining your reasoning. Consider: accuracy, consistency across classes, and simplicity.

▶ Step 6

Make 3 New Predictions

Using your chosen best model, predict the species for these 3 new flower measurements:

Flower 1: [5.0, 3.4, 1.5, 0.2]
Flower 2: [6.7, 3.0, 5.2, 2.3]
Flower 3: [5.9, 3.0, 4.2, 1.5]

Print which species each flower is predicted to be.

Your Turn — Complete the Pipeline

The scaffold below has # YOUR CODE HERE comments where you need to fill in. Work through each step. You can run at any point to check your progress.

import micropip
await micropip.install(['scikit-learn'])
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

print("=" * 55)
print("   MACHINE LEARNING CAPSTONE PROJECT")
print("   Iris Species Classification Pipeline")
print("=" * 55)

# ─────────────────────────────────────────
# STEP 1: Load & Explore the Data
# ─────────────────────────────────────────
print("\n STEP 1: LOAD & EXPLORE\n")
iris = load_iris()
X, y = iris.data, iris.target

# YOUR CODE HERE:
# Print the dataset shape, class names, and class distribution
# print(f"Shape: {X.shape}")
# print(f"Classes: {list(iris.target_names)}")
# ... and class counts

# ─────────────────────────────────────────
# STEP 2: Train/Test Split (80/20)
# ─────────────────────────────────────────
print("\n STEP 2: TRAIN/TEST SPLIT\n")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# YOUR CODE HERE: Print train and test sizes

# ─────────────────────────────────────────
# STEP 3: Train Three Models
# ─────────────────────────────────────────
print("\n STEP 3: TRAINING THREE MODELS\n")

# Model A: KNN (k=5)
model_a = KNeighborsClassifier(n_neighbors=5)
# YOUR CODE HERE: Fit model_a on training data

# Model B: Decision Tree (depth=3)
model_b = DecisionTreeClassifier(max_depth=3, random_state=42)
# YOUR CODE HERE: Fit model_b on training data

# Model C: YOUR CHOICE — replace with your chosen model
# Option 1: KNN with different k, e.g., KNeighborsClassifier(n_neighbors=7)
# Option 2: DecisionTreeClassifier(max_depth=5, random_state=42)
# Option 3: from sklearn.naive_bayes import GaussianNB; model_c = GaussianNB()
model_c = KNeighborsClassifier(n_neighbors=7)  # ← change this!
# YOUR CODE HERE: Fit model_c on training data

print("All 3 models trained!")

# ─────────────────────────────────────────
# STEP 4: Evaluate All Three
# ─────────────────────────────────────────
print("\n STEP 4: EVALUATION RESULTS\n")

models = [
    ("Model A — KNN (k=5)",          model_a),
    ("Model B — Decision Tree (d=3)", model_b),
    ("Model C — Your Choice",         model_c),
]

# YOUR CODE HERE: Loop through models, predict, print accuracy
# for name, model in models:
#     y_pred = model.predict(X_test)
#     acc = accuracy_score(y_test, y_pred)
#     print(f"{name}: Accuracy = {acc:.1%}")

# ─────────────────────────────────────────
# STEP 5: Select the Best Model
# ─────────────────────────────────────────
print("\n STEP 5: BEST MODEL SELECTION\n")
# YOUR CODE HERE: Print which model you pick and WHY
# Consider accuracy, simplicity, and consistency
# print("Best model: Model X because ...")

# ─────────────────────────────────────────
# STEP 6: Make 3 New Predictions
# ─────────────────────────────────────────
print("\n STEP 6: NEW PREDICTIONS\n")
new_flowers = np.array([
    [5.0, 3.4, 1.5, 0.2],
    [6.7, 3.0, 5.2, 2.3],
    [5.9, 3.0, 4.2, 1.5],
])
# YOUR CODE HERE: Use your best model to predict species for each flower
# for i, flower in enumerate(new_flowers):
#     pred = best_model.predict([flower])[0]
#     print(f"Flower {i+1} {flower} → {iris.target_names[pred]}")

print("\n Pipeline complete! Great work!")

Output will appear here after you click Run… (~15 seconds first run)

Stuck on a step? Look back at the module that taught it: Step 1 → Module 2, Step 2 → Module 3, Steps 3–4 → Modules 4–5 & 7, Step 6 → Module 4. Read the scaffolding comments carefully — most lines just need to be uncommented and completed.

View Sample Solution (try it yourself first!)

import micropip await micropip.install(['scikit-learn']) from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report, accuracy_score import numpy as np print("=" * 55) print(" MACHINE LEARNING CAPSTONE PROJECT") print(" Iris Species Classification Pipeline") print("=" * 55) # STEP 1: Load & Explore print("\n STEP 1: LOAD & EXPLORE\n") iris = load_iris() X, y = iris.data, iris.target print(f"Shape: {X.shape[0]} samples × {X.shape[1]} features") print(f"Classes: {list(iris.target_names)}") from collections import Counter for species, count in zip(iris.target_names, [sum(y==i) for i in range(3)]): print(f" {species}: {count} samples") print("\nFeature ranges:") for i, name in enumerate(iris.feature_names): print(f" {name}: [{X[:,i].min():.1f}, {X[:,i].max():.1f}]") # STEP 2: Split print("\n STEP 2: TRAIN/TEST SPLIT\n") X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"Training: {len(X_train)} samples") print(f"Test: {len(X_test)} samples") # STEP 3: Train Three Models print("\n STEP 3: TRAINING THREE MODELS\n") model_a = KNeighborsClassifier(n_neighbors=5) model_a.fit(X_train, y_train) model_b = DecisionTreeClassifier(max_depth=3, random_state=42) model_b.fit(X_train, y_train) model_c = KNeighborsClassifier(n_neighbors=7) model_c.fit(X_train, y_train) print("All 3 models trained!") # STEP 4: Evaluate print("\n STEP 4: EVALUATION RESULTS\n") models = [ ("Model A — KNN (k=5)", model_a), ("Model B — Decision Tree (d=3)", model_b), ("Model C — KNN (k=7)", model_c), ] best_acc = 0 best_model = None best_name = "" print(f"{'Model':<35} {'Accuracy':>10}") print("-" * 47) for name, model in models: y_pred = model.predict(X_test) acc = accuracy_score(y_test, y_pred) print(f"{name:<35} {acc:>10.1%}") if acc > best_acc: best_acc = acc best_model = model best_name = name # STEP 5: Best Model print(f"\n STEP 5: BEST MODEL SELECTION\n") print(f"Selected: {best_name} ({best_acc:.1%} accuracy)") print("Reasoning: highest accuracy on test set.") print("Decision Trees are also interpretable — useful for explaining results.") # STEP 6: New Predictions print("\n STEP 6: NEW PREDICTIONS\n") new_flowers = np.array([ [5.0, 3.4, 1.5, 0.2], [6.7, 3.0, 5.2, 2.3], [5.9, 3.0, 4.2, 1.5], ]) for i, flower in enumerate(new_flowers): pred = best_model.predict([flower])[0] print(f"Flower {i+1} {flower} → {iris.target_names[pred]}") print("\n Pipeline complete!")

Key Takeaways from the Full Course

ML = learned patterns, not explicit rules. Show the model examples; it finds the patterns.
Always explore your data before modeling. Shape, class balance, missing values, ranges.
Preprocess before training: fill missing values, encode categories, scale features, split into train/test.
The sklearn API is always: import → create → fit → predict → evaluate. It works the same for every algorithm.
Don’t trust accuracy alone. Use precision, recall, and F1 — especially when classes are imbalanced.
Overfitting is real. Watch for a large gap between train and test accuracy, and control it with regularization (depth, k).
ML is a cycle, not a finish line. Data, model, evaluate, improve, repeat.

What’s Next?

Python Practice Labs

10 hands-on Python labs with live code. Reinforce the Python fundamentals that power every ML project you build.

Data Analyst Course

Apply ML and data science skills in a career-track context. SQL, Python, Tableau, Power BI, and real business analysis.

Introduction to Statistics

The math behind ML. Distributions, hypothesis testing, correlation, and regression — understand why the algorithms work.

Machine Learning Basics — Complete!

You’ve finished all 8 modules. You understand what ML is, how it works, and you’ve trained real models in your browser. That’s something most people never do.

Share your achievement, explore the next courses above, or revisit any module to deepen your understanding.

← Return to Course Home

← Module 7: Evaluating Models Module 8 of 8 — Course Complete!