Decision Trees
Yes/no questions, branching logic, and the danger of memorizing your training data
📌 Before You Start
- Modules 1–4 completed
- You understand train/test accuracy as separate concepts
Estimated time: ~55 minutes
What you’ll learn: How decision trees split data, what Gini impurity means, how tree depth controls complexity, and how to recognize overfitting.
💡 The Big Idea
A decision tree asks a series of yes/no questions about your features, splitting the data at each step, until it arrives at a prediction.
It’s like the game 20 Questions, but the computer chooses which questions are most useful. “Is petal length < 2.5 cm?” If yes → almost certainly setosa. If no → ask another question.
Decision trees are hugely popular because they’re interpretable — you can actually read the rules the model learned. They also require no feature scaling (unlike KNN).
But they have one big weakness: a deep tree will memorize the training data instead of learning generalizable patterns. That’s overfitting — and controlling tree depth is how we fight it.
🧠 How It Works
How the Tree Chooses Its Questions
At each node, the algorithm tries every possible split on every feature and picks the one that creates the purest child groups. “Purity” means: one class dominates the group.
The most common measure of impurity is Gini impurity. A node with all one class has Gini = 0 (perfectly pure). A node with equal mix has Gini = 0.5 (maximally impure). The tree always picks the split that reduces Gini the most.
An Example Split
Just 2 questions correctly classifies most of the 150 flowers. That’s the power of decision trees.
Depth and Overfitting
⚠️ Overfitting: When training accuracy = 100% but test accuracy is much lower, your model has memorized the training data rather than learning generalizable patterns. A depth-unlimited tree will always overfit.
| Max Depth | Train Accuracy | Test Accuracy | Verdict |
|---|---|---|---|
| 1 | ~67% | ~67% | Underfit (too simple) |
| 3 | ~98% | ~97% | Good balance |
| None | 100% | ~93% | Overfit (memorized) |
▶️ See It In Code
Training decision trees with different depths and reading the actual decision rules.
👋 Your Turn
Run the code below. When depth is set to None (unlimited), the training accuracy reaches 100%. Your task: explain in the comment why 100% training accuracy is not a good thing, and find the depth that gives the smallest gap between train and test accuracy.
for depth in range(1, 11): and track which depth minimizes abs(train_acc - test_acc).☕ Brain Break — 2 Minutes
Imagine two students studying for an exam:
- Student A memorized every practice problem word-for-word. Gets 100% on practice tests.
- Student B learned the underlying concepts. Gets 92% on practice tests.
The real exam has new problems. Who does better?
Student A is our overfit model. Student B is our well-generalized model.
Overfitting in ML is exactly this: a model that’s so tuned to the training examples that it fails on anything new. Controlling tree depth is one way to force the model to learn real patterns, not just memorize.
✅ Key Takeaways
- Decision trees split data using yes/no questions on features, choosing splits that maximize purity (minimize Gini impurity).
- Overfitting = high train accuracy, much lower test accuracy. The model memorized training data.
- Control overfitting by limiting max_depth. A shallow tree generalizes better even if it doesn’t perfectly fit training data.
- Decision trees are interpretable — you can print and read the exact rules using
export_text. - The
feature_importances_attribute tells you which features the tree relied on most — very useful for understanding your data.
🎉 Module 5 Complete!
You now understand classification trees and overfitting. Next, we switch from predicting categories to predicting numbers with linear regression.