Evaluating Models

Why “99% accurate” can be completely useless — and what to measure instead

← Module 6: Linear Regression Module 7 of 8 Module 8: Capstone Project →

Loading Python… (first load ~15 seconds)

Before You Start

Modules 1–6 completed
Understanding of binary classification (two classes: positive / negative)

Estimated time: ~55 minutes

What you’ll learn: The accuracy trap, the confusion matrix, precision, recall, F1-score, and which metric to prioritize in different real-world scenarios.

The Big Idea

Imagine a dataset with 1,000 emails: 990 normal, 10 spam. A model that always predicts “not spam” achieves 99% accuracy — yet it’s completely useless because it never catches spam.

This is the accuracy trap. When classes are imbalanced, accuracy is a misleading metric.

Better evaluation requires understanding four types of outcomes: True Positives, True Negatives, False Positives, and False Negatives — the confusion matrix. From these, we derive precision, recall, and F1-score, each capturing a different aspect of model quality.

The right metric depends on your problem. In cancer detection, missing a real cancer (false negative) is catastrophic. In spam filtering, marking a real email as spam (false positive) is merely annoying. The stakes determine the metric.

How It Works

The Confusion Matrix

For a binary classifier (positive = has cancer, negative = no cancer):

Predicted POSITIVE

Predicted NEGATIVE

Actual POSITIVE

TP — True Positive
Correctly flagged cancer

FN — False Negative
Missed real cancer

Actual NEGATIVE

FP — False Positive
Wrongly flagged healthy

TN — True Negative
Correctly cleared healthy

The Metrics

Accuracy

= (TP + TN) / Total

Fraction of all predictions that were correct. Misleading when classes are imbalanced.

Precision

= TP / (TP + FP)

Of everything the model flagged as positive, how many were actually positive? High precision = few false alarms.

Recall (Sensitivity)

= TP / (TP + FN)

Of all actual positives, how many did we catch? High recall = few missed cases. Critical for medical diagnosis.

F1-Score

= 2 × (P × R) / (P + R)

Harmonic mean of precision and recall. Best single number when you need to balance both. Range: 0 to 1.

When to Prioritize Which Metric

Scenario	Worst Error	Prioritize
Cancer detection	Missing cancer (FN)	Recall
Spam filter	Blocking real email (FP)	Precision
Fraud detection	Depends on cost of each	F1-Score
Balanced dataset	Equal concern	Accuracy (ok here)

▶ See It In Code

Full evaluation pipeline on the breast cancer dataset: confusion matrix and classification report.

import micropip await micropip.install(['scikit-learn']) from sklearn.metrics import confusion_matrix, classification_report from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split # Binary classification: breast cancer detection # y=0 malignant (cancer), y=1 benign (no cancer) cancer = load_breast_cancer() X, y = cancer.data, cancer.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"Dataset: {len(X)} samples, {X.shape[1]} features") print(f"Classes: {list(cancer.target_names)}") print(f"Class counts: malignant={sum(y==0)}, benign={sum(y==1)}") # Train KNN knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) # Confusion Matrix cm = confusion_matrix(y_test, y_pred) print("\n=== Confusion Matrix ===") print(f" Predicted Malignant Predicted Benign") print(f"Actual Malignant: {cm[0,0]:>18} {cm[0,1]:>16}") print(f"Actual Benign: {cm[1,0]:>18} {cm[1,1]:>16}") print(f"\nTP (correctly flagged cancer): {cm[0,0]}") print(f"FN (missed real cancer): {cm[0,1]} ← most dangerous!") print(f"FP (wrongly flagged healthy): {cm[1,0]}") print(f"TN (correctly cleared healthy): {cm[1,1]}") # Classification Report print("\n=== Classification Report ===") print(classification_report(y_test, y_pred, target_names=cancer.target_names))

Your Turn

Look at the output from the code above (or run the modified version below). In cancer detection, which is worse: a false positive or a false negative? Add code that calculates recall manually and prints a recommendation about which metric matters most here.

import micropip
await micropip.install(['scikit-learn'])
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
TP = cm[0, 0]
FN = cm[0, 1]
FP = cm[1, 0]
TN = cm[1, 1]

print(f"TP={TP}, FN={FN}, FP={FP}, TN={TN}")

# YOUR CODE HERE:
# 1. Calculate accuracy manually: (TP + TN) / (TP + FN + FP + TN)
# 2. Calculate precision manually: TP / (TP + FP)
# 3. Calculate recall manually: TP / (TP + FN)
# 4. Print all three values
# 5. Add a print statement explaining which metric is most important
#    for this problem and why

# Example:
# accuracy = (TP + TN) / (TP + FN + FP + TN)
# print(f"Accuracy: {accuracy:.1%}")

Output will appear here after you click Run… (~10 seconds first run)

Hint: A false negative in cancer detection means a patient with cancer is told they’re healthy. They won’t get treatment. A false positive means a healthy patient gets more tests — stressful, but survivable. Which error has more serious consequences?

Brain Break — 2 Minutes

The justice system faces the same tradeoff:

False Positive = convicting an innocent person
False Negative = acquitting a guilty person

Western legal systems say “innocent until proven guilty” — they are tuned for high precision. It’s considered worse to convict an innocent person than to let a guilty person go free.

Different societies and different problems make different choices. There’s no universally “correct” tradeoff — it always depends on the real-world consequences of each type of error. This is why understanding your domain matters as much as understanding your algorithm.

Key Takeaways

Accuracy is misleading when classes are imbalanced — always check class distribution first.
The confusion matrix shows TP, TN, FP, FN — the raw counts behind every classification metric.
Precision = “when you say positive, how often are you right?” High precision = fewer false alarms.
Recall = “of all actual positives, how many did you catch?” High recall = fewer missed cases. Use for medical/safety problems.
F1-score balances precision and recall. Use when you can’t clearly prioritize one over the other.

Module 7 Complete!

You now have a full toolkit: data prep, two classifiers, regression, and proper evaluation. Time to put it all together in the capstone!

Continue to Module 8: Capstone Project →

← Module 6: Linear Regression Module 7 of 8 Module 8: Capstone Project →