Evaluating Models
Why “99% accurate” can be completely useless — and what to measure instead
📌 Before You Start
- Modules 1–6 completed
- Understanding of binary classification (two classes: positive / negative)
Estimated time: ~55 minutes
What you’ll learn: The accuracy trap, the confusion matrix, precision, recall, F1-score, and which metric to prioritize in different real-world scenarios.
💡 The Big Idea
Imagine a dataset with 1,000 emails: 990 normal, 10 spam. A model that always predicts “not spam” achieves 99% accuracy — yet it’s completely useless because it never catches spam.
This is the accuracy trap. When classes are imbalanced, accuracy is a misleading metric.
Better evaluation requires understanding four types of outcomes: True Positives, True Negatives, False Positives, and False Negatives — the confusion matrix. From these, we derive precision, recall, and F1-score, each capturing a different aspect of model quality.
The right metric depends on your problem. In cancer detection, missing a real cancer (false negative) is catastrophic. In spam filtering, marking a real email as spam (false positive) is merely annoying. The stakes determine the metric.
🧠 How It Works
The Confusion Matrix
For a binary classifier (positive = has cancer, negative = no cancer):
Correctly flagged cancer
Missed real cancer ⚠️
Wrongly flagged healthy
Correctly cleared healthy
The Metrics
Accuracy
= (TP + TN) / Total
Fraction of all predictions that were correct. Misleading when classes are imbalanced.
Precision
= TP / (TP + FP)
Of everything the model flagged as positive, how many were actually positive? High precision = few false alarms.
Recall (Sensitivity)
= TP / (TP + FN)
Of all actual positives, how many did we catch? High recall = few missed cases. Critical for medical diagnosis.
F1-Score
= 2 × (P × R) / (P + R)
Harmonic mean of precision and recall. Best single number when you need to balance both. Range: 0 to 1.
When to Prioritize Which Metric
| Scenario | Worst Error | Prioritize |
|---|---|---|
| Cancer detection | Missing cancer (FN) | Recall |
| Spam filter | Blocking real email (FP) | Precision |
| Fraud detection | Depends on cost of each | F1-Score |
| Balanced dataset | Equal concern | Accuracy (ok here) |
▶️ See It In Code
Full evaluation pipeline on the breast cancer dataset: confusion matrix and classification report.
👋 Your Turn
Look at the output from the code above (or run the modified version below). In cancer detection, which is worse: a false positive or a false negative? Add code that calculates recall manually and prints a recommendation about which metric matters most here.
☕ Brain Break — 2 Minutes
The justice system faces the same tradeoff:
- False Positive = convicting an innocent person
- False Negative = acquitting a guilty person
Western legal systems say “innocent until proven guilty” — they are tuned for high precision. It’s considered worse to convict an innocent person than to let a guilty person go free.
Different societies and different problems make different choices. There’s no universally “correct” tradeoff — it always depends on the real-world consequences of each type of error. This is why understanding your domain matters as much as understanding your algorithm.
✅ Key Takeaways
- Accuracy is misleading when classes are imbalanced — always check class distribution first.
- The confusion matrix shows TP, TN, FP, FN — the raw counts behind every classification metric.
- Precision = “when you say positive, how often are you right?” High precision = fewer false alarms.
- Recall = “of all actual positives, how many did you catch?” High recall = fewer missed cases. Use for medical/safety problems.
- F1-score balances precision and recall. Use when you can’t clearly prioritize one over the other.
🎉 Module 7 Complete!
You now have a full toolkit: data prep, two classifiers, regression, and proper evaluation. Time to put it all together in the capstone!