Linear Regression

Predicting numbers instead of categories — “how much?” instead of “which one?”

← Module 5: Decision Trees Module 6 of 8 Module 7: Evaluating Models →

Loading Python… (first load ~15 seconds)

Before You Start

Modules 1–5 completed
Comfortable with the sklearn API from Module 4
Remember y = mx + b from algebra? That’s the core formula here.

Estimated time: ~50 minutes

What you’ll learn: Regression vs classification, the line of best fit, Mean Squared Error (MSE), and R² score.

The Big Idea

Everything you’ve done so far has predicted a category (setosa? spam? promoted?). Regression predicts a number: “What will the house sell for?” “How many points will this student score?”

Linear regression finds the line of best fit through your data points — the line that minimizes the total distance between itself and all the data points. Once you have that line, any input value maps directly to a predicted output value.

The formula is the same one from algebra class: y = mx + b. The model learns the slope (m) and intercept (b) from your training data.

How It Works

The Formula

score = m × hours + b

m = slope (how much score increases per hour studied) | b = intercept (predicted score at 0 hours)

Linear regression finds the values of m and b that minimize the sum of squared errors across all training examples. Sklearn does all of this for you in one .fit() call.

How to Measure Success in Regression

You can’t use accuracy for regression (there’s no right/wrong, just “how close?”). Two key metrics:

MSE — Mean Squared Error

Average of (actual − predicted)² across all samples. Lower is better. Units are squared (e.g., points²), which can be hard to interpret.

RMSE = √MSE brings it back to original units (e.g., points).

R² Score (R-squared)

How much of the variance in y does your model explain? Scale: 0 to 1 (or negative for terrible models).

R² = 1.0: perfect fit. R² = 0.8: explains 80% of variance. R² = 0.0: no better than predicting the mean.

Classification vs Regression: Quick Comparison

Aspect	Classification	Regression
Output	A category (class)	A number (continuous)
Example question	Spam or not?	What’s the sale price?
Key metric	Accuracy, F1	MSE, R²
sklearn import	KNeighborsClassifier	LinearRegression

▶ See It In Code

A complete linear regression pipeline: fit the line, interpret the formula, evaluate with MSE and R², make a prediction.

import micropip await micropip.install(['scikit-learn']) from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Simulated study hours → exam score dataset np.random.seed(42) hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) scores = np.array([52, 58, 64, 70, 73, 78, 82, 87, 90, 95]) # Fit the model model = LinearRegression() model.fit(hours, scores) # Interpret the formula print("=== The Learned Formula ===") print(f" Slope (m): {model.coef_[0]:.2f} points per hour") print(f" Intercept (b): {model.intercept_:.2f}") print(f" Formula: score = {model.coef_[0]:.2f} × hours + {model.intercept_:.2f}") # Evaluate predictions = model.predict(hours) mse = mean_squared_error(scores, predictions) rmse = mse ** 0.5 r2 = r2_score(scores, predictions) print(f"\n=== Model Performance ===") print(f" MSE: {mse:.2f} (average squared error in points²)") print(f" RMSE: {rmse:.2f} points (average error in original units)") print(f" R²: {r2:.4f} → model explains {r2:.1%} of score variance") # Visualize actual vs predicted print("\n=== Actual vs Predicted ===") print(f"{'Hours':>6} {'Actual':>8} {'Predicted':>10} {'Error':>8}") print("-" * 40) for h, actual, pred in zip(hours.flatten(), scores, predictions): error = actual - pred print(f"{h:>6} {actual:>8} {pred:>10.1f} {error:>+8.1f}") # Make a new prediction new_hours = np.array([[7.5]]) print(f"\n7.5 hours of study → predicted score: {model.predict(new_hours)[0]:.1f}")

Your Turn

Predict the score for someone who studies 12 hours. Run the code, look at the result, and then answer: Does a score above 100 make sense? What does this tell you about using regression to predict values outside the training range?

import micropip
await micropip.install(['scikit-learn'])
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
scores = np.array([52, 58, 64, 70, 73, 78, 82, 87, 90, 95])

model = LinearRegression()
model.fit(hours, scores)

print(f"Training data range: {hours.min()} to {hours.max()} hours")
print(f"R² on training data: {r2_score(scores, model.predict(hours)):.4f}")

# YOUR CODE HERE: Predict the score for 12 hours of study
# Hint: use np.array([[12]]) as input to model.predict()

# Uncomment and complete this line:
# pred_12 = model.predict(np.array([[12]]))[0]
# print(f"12 hours → predicted score: {pred_12:.1f}")

# Then answer this:
# What is the maximum possible exam score?
# Does the predicted value exceed that maximum?
# What does this tell us about using the model outside training range?
# (Add your thoughts as comments or print statements below)

Output will appear here after you click Run… (~10 seconds first run)

Hint: Linear regression always continues the line forever — it has no concept of a score cap at 100. Making predictions far outside your training range is called extrapolation, and it’s usually unreliable. The model only “knows” what it was shown.

Brain Break — 2 Minutes

Think about something in your life that follows a roughly linear pattern:

The more hours you practice piano, the better you get
The more miles you drive, the more gas you use
The more sleep you get, the better you perform on tests

Now think: where does the linearity break down? You can’t sleep 20 hours and get infinite performance. You can’t study 30 hours straight without diminishing returns.

Real-world relationships are often only approximately linear within a certain range. This is the most important limitation of linear regression — and understanding it makes you a better data scientist.

Key Takeaways

Regression predicts a continuous number. Classification predicts a category. Choose based on your target variable.
Linear regression finds the line y = mx + b that minimizes total squared prediction errors.
MSE measures average squared error (lower is better). RMSE gives error in original units.
R² tells you what fraction of variance the model explains. R² = 0.9 means 90% of the target variation is captured by your features.
Be careful with extrapolation — predicting outside the training range is unreliable and may give physically impossible results.

Module 6 Complete!

You can now build both classifiers and regression models. In Module 7, we’ll go deep on evaluation — and learn why accuracy alone can be dangerously misleading.

Continue to Module 7: Evaluating Models →

← Module 5: Decision Trees Module 6 of 8 Module 7: Evaluating Models →