Learn Without Walls
← Back to Machine Learning Basics
Module 7 of 8 — Machine Learning Basics

Evaluating Models

Why “99% accurate” can be completely useless — and what to measure instead

← Module 6: Linear Regression Module 7 of 8 Module 8: Capstone Project →
⏳ Loading Python… (first load ~15 seconds)

📌 Before You Start

Estimated time: ~55 minutes

What you’ll learn: The accuracy trap, the confusion matrix, precision, recall, F1-score, and which metric to prioritize in different real-world scenarios.

💡 The Big Idea

Imagine a dataset with 1,000 emails: 990 normal, 10 spam. A model that always predicts “not spam” achieves 99% accuracy — yet it’s completely useless because it never catches spam.

This is the accuracy trap. When classes are imbalanced, accuracy is a misleading metric.

Better evaluation requires understanding four types of outcomes: True Positives, True Negatives, False Positives, and False Negatives — the confusion matrix. From these, we derive precision, recall, and F1-score, each capturing a different aspect of model quality.

The right metric depends on your problem. In cancer detection, missing a real cancer (false negative) is catastrophic. In spam filtering, marking a real email as spam (false positive) is merely annoying. The stakes determine the metric.

🧠 How It Works

The Confusion Matrix

For a binary classifier (positive = has cancer, negative = no cancer):

Predicted POSITIVE
Predicted NEGATIVE
Actual POSITIVE
TP — True Positive
Correctly flagged cancer
FN — False Negative
Missed real cancer ⚠️
Actual NEGATIVE
FP — False Positive
Wrongly flagged healthy
TN — True Negative
Correctly cleared healthy

The Metrics

Accuracy

= (TP + TN) / Total

Fraction of all predictions that were correct. Misleading when classes are imbalanced.

Precision

= TP / (TP + FP)

Of everything the model flagged as positive, how many were actually positive? High precision = few false alarms.

Recall (Sensitivity)

= TP / (TP + FN)

Of all actual positives, how many did we catch? High recall = few missed cases. Critical for medical diagnosis.

F1-Score

= 2 × (P × R) / (P + R)

Harmonic mean of precision and recall. Best single number when you need to balance both. Range: 0 to 1.

When to Prioritize Which Metric

ScenarioWorst ErrorPrioritize
Cancer detectionMissing cancer (FN)Recall
Spam filterBlocking real email (FP)Precision
Fraud detectionDepends on cost of eachF1-Score
Balanced datasetEqual concernAccuracy (ok here)

▶️ See It In Code

Full evaluation pipeline on the breast cancer dataset: confusion matrix and classification report.

import micropip await micropip.install(['scikit-learn']) from sklearn.metrics import confusion_matrix, classification_report from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split # Binary classification: breast cancer detection # y=0 malignant (cancer), y=1 benign (no cancer) cancer = load_breast_cancer() X, y = cancer.data, cancer.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"Dataset: {len(X)} samples, {X.shape[1]} features") print(f"Classes: {list(cancer.target_names)}") print(f"Class counts: malignant={sum(y==0)}, benign={sum(y==1)}") # Train KNN knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) # Confusion Matrix cm = confusion_matrix(y_test, y_pred) print("\n=== Confusion Matrix ===") print(f" Predicted Malignant Predicted Benign") print(f"Actual Malignant: {cm[0,0]:>18} {cm[0,1]:>16}") print(f"Actual Benign: {cm[1,0]:>18} {cm[1,1]:>16}") print(f"\nTP (correctly flagged cancer): {cm[0,0]}") print(f"FN (missed real cancer): {cm[0,1]} ← most dangerous!") print(f"FP (wrongly flagged healthy): {cm[1,0]}") print(f"TN (correctly cleared healthy): {cm[1,1]}") # Classification Report print("\n=== Classification Report ===") print(classification_report(y_test, y_pred, target_names=cancer.target_names))

👋 Your Turn

Look at the output from the code above (or run the modified version below). In cancer detection, which is worse: a false positive or a false negative? Add code that calculates recall manually and prints a recommendation about which metric matters most here.

Output will appear here after you click Run… (~10 seconds first run)
💡 Hint: A false negative in cancer detection means a patient with cancer is told they’re healthy. They won’t get treatment. A false positive means a healthy patient gets more tests — stressful, but survivable. Which error has more serious consequences?

☕ Brain Break — 2 Minutes

The justice system faces the same tradeoff:

Western legal systems say “innocent until proven guilty” — they are tuned for high precision. It’s considered worse to convict an innocent person than to let a guilty person go free.

Different societies and different problems make different choices. There’s no universally “correct” tradeoff — it always depends on the real-world consequences of each type of error. This is why understanding your domain matters as much as understanding your algorithm.

✅ Key Takeaways

🎉 Module 7 Complete!

You now have a full toolkit: data prep, two classifiers, regression, and proper evaluation. Time to put it all together in the capstone!

Continue to Module 8: Capstone Project →

← Module 6: Linear Regression Module 7 of 8 Module 8: Capstone Project →