Your First Classifier — K-Nearest Neighbors

Find the closest examples, take a vote, make a prediction

← Module 3: Preparing Data Module 4 of 8 Module 5: Decision Trees →

Loading Python… (first load ~15 seconds)

This module uses scikit-learn. The first time you run a code block, it will install scikit-learn via micropip (~5–10 seconds). Be patient — subsequent runs are instant!

Before You Start

Modules 1–3 completed (especially understanding train/test split)
Patience for the first scikit-learn load (~10–20 seconds total)

Estimated time: ~55 minutes

What you’ll learn: How K-Nearest Neighbors works, how to use scikit-learn’s API (fit → predict → score), and how the choice of k affects results.

The Big Idea

KNN asks: “What do the k most similar examples in my training set look like?” Then it takes a majority vote among those neighbors.

No complicated math. No training phase. When you call .predict() on a new flower, KNN just finds the 3 (or 5, or k) closest flowers in the training set and says “it’s probably the same species as most of them.”

It’s like asking your k nearest neighbors what they think — and going with the majority.

Simple, intuitive, and surprisingly effective for many problems. And it’s a perfect starting point for understanding the scikit-learn API that all ML algorithms share.

How It Works

KNN Step by Step

Store all training examples — KNN doesn’t really “train”; it just memorizes. This is why it’s called a lazy learner.

Receive a new sample to classify (e.g., a flower with petal_length=1.5, petal_width=0.3).

Calculate distance from the new sample to every training sample. Usually Euclidean distance: d = √((x&sub1;−x&sub2;)² + (y&sub1;−y&sub2;)² + …)

Find the k nearest neighbors — the k training samples with the smallest distance.

Vote — whichever class appears most among the k neighbors wins. That’s the prediction.

Why k Matters

k = 1: Follows training data perfectly. Very sensitive to noise. Often overfit.
k = 3 or 5: Good balance for many datasets. Filters noise while staying responsive.
k = 50+: Smooths over everything. May miss local patterns (underfit).

There’s no universal “best k.” You find it by experimenting — exactly what your exercise will do.

The sklearn API (same for every algorithm!)

 # 1. Import the algorithm
from sklearn.neighbors import KNeighborsClassifier

# 2. Create the model (set hyperparameters)
model = KNeighborsClassifier(n_neighbors=3)

# 3. Fit (train) on training data
model.fit(X_train, y_train)

# 4. Predict on new data
predictions = model.predict(X_test)

# 5. Score (accuracy on test set)
accuracy = model.score(X_test, y_test)

This 5-step pattern works for every sklearn classifier: decision trees, random forests, SVMs. Learn it once, use it everywhere.

▶ See It In Code

A complete KNN pipeline on the Iris dataset. This will install scikit-learn — first run takes 10–20 seconds.

⏳ First run will install scikit-learn (~10 seconds). Watch the output area for progress.

import micropip await micropip.install(['scikit-learn']) from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris import numpy as np print("Loading Iris dataset...") iris = load_iris() X, y = iris.data, iris.target print(f"Dataset shape: {X.shape[0]} samples × {X.shape[1]} features") print(f"Classes: {list(iris.target_names)}") print(f"Class counts: {[sum(y==i) for i in range(3)]}") # Split into train and test (80/20) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"\nTraining samples: {len(X_train)}, Test samples: {len(X_test)}") # Train KNN with k=3 knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) accuracy = knn.score(X_test, y_test) print(f"\nKNN (k=3) Test Accuracy: {accuracy:.1%}") # Make predictions on all test samples y_pred = knn.predict(X_test) correct = sum(y_pred == y_test) print(f"Correct predictions: {correct}/{len(y_test)}") # Predict a single new flower sample = np.array([[5.1, 3.5, 1.4, 0.2]]) # typical setosa pred = knn.predict(sample)[0] prob = knn.predict_proba(sample)[0] print(f"\nNew flower {sample[0].tolist()} →") print(f" Predicted: {iris.target_names[pred]}") print(f" Confidence: {max(prob):.1%}")

Your Turn

Run the code below and find out which value of k gives the best accuracy on the Iris test set. Try k = 1, 5, and 10. Record the results and explain what you observe.

import micropip
await micropip.install(['scikit-learn'])
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("=== KNN Accuracy for Different k Values ===\n")

# Try k = 1, 5, and 10
# YOUR CODE HERE: add more k values to this list!
k_values = [1, 5, 10]

best_k = None
best_acc = 0

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    print(f"  k={k:>3}: Accuracy = {acc:.1%}")
    if acc > best_acc:
        best_acc = acc
        best_k = k

print(f"\nBest k: {best_k} with accuracy {best_acc:.1%}")
print("\nTry adding more k values (e.g., 3, 7, 15, 20) to find the true best!")

Output will appear here after you click Run… (first run installs scikit-learn, ~10 seconds)

Hint: Add more values to k_values like [1, 3, 5, 7, 10, 15, 20]. Notice how accuracy changes. Which tends to be better — very small k or very large k?

Brain Break — 2 Minutes

You’re new to a city and trying to decide where to eat. You ask your 3 nearest neighbors:

Neighbor 1: “Go to the pizza place on 5th!”
Neighbor 2: “Pizza on 5th is great!”
Neighbor 3: “Try the sushi place instead.”

Majority vote: 2 pizza vs 1 sushi → you go for pizza. That’s exactly KNN.

Now imagine asking 100 neighbors. Some live far away and don’t even know the pizza place exists. Their votes might not be helpful. This is why k matters — too many neighbors can drown out the signal.

The right number of “neighbors” to consult is almost always somewhere in the middle.

Key Takeaways

KNN classifies by finding the k most similar training examples and taking a majority vote. No training phase — it just memorizes.
The standard sklearn API pattern is: import → create → fit → predict → score. This works for every algorithm.
k too small (k=1): overfits, sensitive to noise. k too large: underfits, ignores local patterns. The sweet spot is usually between 3 and 15.
Always set a random_state in train_test_split to get reproducible results.
KNN requires feature scaling (see Module 3) because it’s purely distance-based. Unscaled features will bias the results.

Module 4 Complete!

You’ve trained your first real ML model! Next, we’ll explore a completely different approach — one that makes decisions by asking yes/no questions.

Continue to Module 5: Decision Trees →

← Module 3: Preparing Data Module 4 of 8 Module 5: Decision Trees →