Decision Trees

Yes/no questions, branching logic, and the danger of memorizing your training data

← Module 4: KNN Module 5 of 8 Module 6: Linear Regression →

Loading Python… (first load ~15 seconds)

Before You Start

Modules 1–4 completed
You understand train/test accuracy as separate concepts

Estimated time: ~55 minutes

What you’ll learn: How decision trees split data, what Gini impurity means, how tree depth controls complexity, and how to recognize overfitting.

The Big Idea

A decision tree asks a series of yes/no questions about your features, splitting the data at each step, until it arrives at a prediction.

It’s like the game 20 Questions, but the computer chooses which questions are most useful. “Is petal length < 2.5 cm?” If yes → almost certainly setosa. If no → ask another question.

Decision trees are hugely popular because they’re interpretable — you can actually read the rules the model learned. They also require no feature scaling (unlike KNN).

But they have one big weakness: a deep tree will memorize the training data instead of learning generalizable patterns. That’s overfitting — and controlling tree depth is how we fight it.

How It Works

How the Tree Chooses Its Questions

At each node, the algorithm tries every possible split on every feature and picks the one that creates the purest child groups. “Purity” means: one class dominates the group.

The most common measure of impurity is Gini impurity. A node with all one class has Gini = 0 (perfectly pure). A node with equal mix has Gini = 0.5 (maximally impure). The tree always picks the split that reduces Gini the most.

An Example Split

 Root: All 150 flowers
│
├─── petal_length ≤ 2.45? YES → 50 setosa (Gini = 0.0)  Pure!
│
└─── petal_length ≤ 2.45? NO  → 100 mixed (versicolor + virginica)
         │
         ├─── petal_width ≤ 1.75? YES → ~49 versicolor
         │
         └─── petal_width ≤ 1.75? NO  → ~45 virginica

Just 2 questions correctly classifies most of the 150 flowers. That’s the power of decision trees.

Depth and Overfitting

Overfitting: When training accuracy = 100% but test accuracy is much lower, your model has memorized the training data rather than learning generalizable patterns. A depth-unlimited tree will always overfit.

Max Depth	Train Accuracy	Test Accuracy	Verdict
1	~67%	~67%	Underfit (too simple)
3	~98%	~97%	Good balance
None	100%	~93%	Overfit (memorized)

▶ See It In Code

Training decision trees with different depths and reading the actual decision rules.

import micropip await micropip.install(['scikit-learn']) from sklearn.tree import DecisionTreeClassifier, export_text from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Compare trees at different depths print("=== Train vs Test Accuracy by Depth ===\n") print(f"{'Depth':>10} {'Train Acc':>10} {'Test Acc':>10} {'Gap':>8}") print("-" * 44) for depth in [1, 2, 3, 4, None]: dt = DecisionTreeClassifier(max_depth=depth, random_state=42) dt.fit(X_train, y_train) train_acc = dt.score(X_train, y_train) test_acc = dt.score(X_test, y_test) gap = train_acc - test_acc depth_str = str(depth) if depth else "unlimited" print(f"{depth_str:>10} {train_acc:>10.1%} {test_acc:>10.1%} {gap:>8.1%}") # Show the learned rules for depth=2 print("\n=== Decision Rules Learned (depth=2) ===\n") dt2 = DecisionTreeClassifier(max_depth=2, random_state=42) dt2.fit(X_train, y_train) rules = export_text(dt2, feature_names=list(iris.feature_names)) print(rules) # Feature importance print("=== Feature Importance ===") for name, imp in zip(iris.feature_names, dt2.feature_importances_): bar = "█" * int(imp * 30) print(f" {name:<20} {imp:.3f} {bar}")

Your Turn

Run the code below. When depth is set to None (unlimited), the training accuracy reaches 100%. Your task: explain in the comment why 100% training accuracy is not a good thing, and find the depth that gives the smallest gap between train and test accuracy.

import micropip
await micropip.install(['scikit-learn'])
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# YOUR CODE HERE:
# 1. Train a tree with max_depth=None (unlimited)
# 2. Print both train and test accuracy
# 3. Add a comment explaining why 100% train accuracy is NOT good

dt_unlimited = DecisionTreeClassifier(max_depth=None, random_state=42)
dt_unlimited.fit(X_train, y_train)

train_acc = dt_unlimited.score(X_train, y_train)
test_acc = dt_unlimited.score(X_test, y_test)

print(f"Unlimited depth — Train: {train_acc:.1%}, Test: {test_acc:.1%}")
print(f"Gap: {train_acc - test_acc:.1%}")

# YOUR COMMENT HERE: Why is 100% training accuracy a warning sign?
# (Replace this comment with your explanation)

# Bonus: loop through depths 1-10 and find which depth has the smallest gap
print("\n=== Finding the best depth (smallest train-test gap) ===")
# Add your loop here!

Output will appear here after you click Run… (~10 seconds first run)

Hint: 100% training accuracy with a gap on test data means the model memorized the training set rather than learning general rules. For the bonus loop, use for depth in range(1, 11): and track which depth minimizes abs(train_acc - test_acc).

Brain Break — 2 Minutes

Imagine two students studying for an exam:

Student A memorized every practice problem word-for-word. Gets 100% on practice tests.
Student B learned the underlying concepts. Gets 92% on practice tests.

The real exam has new problems. Who does better?

Student A is our overfit model. Student B is our well-generalized model.

Overfitting in ML is exactly this: a model that’s so tuned to the training examples that it fails on anything new. Controlling tree depth is one way to force the model to learn real patterns, not just memorize.

Key Takeaways

Decision trees split data using yes/no questions on features, choosing splits that maximize purity (minimize Gini impurity).
Overfitting = high train accuracy, much lower test accuracy. The model memorized training data.
Control overfitting by limiting max_depth. A shallow tree generalizes better even if it doesn’t perfectly fit training data.
Decision trees are interpretable — you can print and read the exact rules using export_text.
The feature_importances_ attribute tells you which features the tree relied on most — very useful for understanding your data.

Module 5 Complete!

You now understand classification trees and overfitting. Next, we switch from predicting categories to predicting numbers with linear regression.

Continue to Module 6: Linear Regression →

← Module 4: KNN Module 5 of 8 Module 6: Linear Regression →