Learn Without Walls
← Back to Machine Learning Basics
Module 6 of 8 — Machine Learning Basics

Linear Regression

Predicting numbers instead of categories — “how much?” instead of “which one?”

← Module 5: Decision Trees Module 6 of 8 Module 7: Evaluating Models →
⏳ Loading Python… (first load ~15 seconds)

📌 Before You Start

Estimated time: ~50 minutes

What you’ll learn: Regression vs classification, the line of best fit, Mean Squared Error (MSE), and R² score.

💡 The Big Idea

Everything you’ve done so far has predicted a category (setosa? spam? promoted?). Regression predicts a number: “What will the house sell for?” “How many points will this student score?”

Linear regression finds the line of best fit through your data points — the line that minimizes the total distance between itself and all the data points. Once you have that line, any input value maps directly to a predicted output value.

The formula is the same one from algebra class: y = mx + b. The model learns the slope (m) and intercept (b) from your training data.

🧠 How It Works

The Formula

score = m × hours + b

m = slope (how much score increases per hour studied)  |  b = intercept (predicted score at 0 hours)

Linear regression finds the values of m and b that minimize the sum of squared errors across all training examples. Sklearn does all of this for you in one .fit() call.

How to Measure Success in Regression

You can’t use accuracy for regression (there’s no right/wrong, just “how close?”). Two key metrics:

MSE — Mean Squared Error

Average of (actual − predicted)² across all samples. Lower is better. Units are squared (e.g., points²), which can be hard to interpret.

RMSE = √MSE brings it back to original units (e.g., points).

R² Score (R-squared)

How much of the variance in y does your model explain? Scale: 0 to 1 (or negative for terrible models).

R² = 1.0: perfect fit. R² = 0.8: explains 80% of variance. R² = 0.0: no better than predicting the mean.

Classification vs Regression: Quick Comparison

AspectClassificationRegression
OutputA category (class)A number (continuous)
Example questionSpam or not?What’s the sale price?
Key metricAccuracy, F1MSE, R²
sklearn importKNeighborsClassifierLinearRegression

▶️ See It In Code

A complete linear regression pipeline: fit the line, interpret the formula, evaluate with MSE and R², make a prediction.

import micropip await micropip.install(['scikit-learn']) from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Simulated study hours → exam score dataset np.random.seed(42) hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) scores = np.array([52, 58, 64, 70, 73, 78, 82, 87, 90, 95]) # Fit the model model = LinearRegression() model.fit(hours, scores) # Interpret the formula print("=== The Learned Formula ===") print(f" Slope (m): {model.coef_[0]:.2f} points per hour") print(f" Intercept (b): {model.intercept_:.2f}") print(f" Formula: score = {model.coef_[0]:.2f} × hours + {model.intercept_:.2f}") # Evaluate predictions = model.predict(hours) mse = mean_squared_error(scores, predictions) rmse = mse ** 0.5 r2 = r2_score(scores, predictions) print(f"\n=== Model Performance ===") print(f" MSE: {mse:.2f} (average squared error in points²)") print(f" RMSE: {rmse:.2f} points (average error in original units)") print(f" R²: {r2:.4f} → model explains {r2:.1%} of score variance") # Visualize actual vs predicted print("\n=== Actual vs Predicted ===") print(f"{'Hours':>6} {'Actual':>8} {'Predicted':>10} {'Error':>8}") print("-" * 40) for h, actual, pred in zip(hours.flatten(), scores, predictions): error = actual - pred print(f"{h:>6} {actual:>8} {pred:>10.1f} {error:>+8.1f}") # Make a new prediction new_hours = np.array([[7.5]]) print(f"\n7.5 hours of study → predicted score: {model.predict(new_hours)[0]:.1f}")

👋 Your Turn

Predict the score for someone who studies 12 hours. Run the code, look at the result, and then answer: Does a score above 100 make sense? What does this tell you about using regression to predict values outside the training range?

Output will appear here after you click Run… (~10 seconds first run)
💡 Hint: Linear regression always continues the line forever — it has no concept of a score cap at 100. Making predictions far outside your training range is called extrapolation, and it’s usually unreliable. The model only “knows” what it was shown.

☕ Brain Break — 2 Minutes

Think about something in your life that follows a roughly linear pattern:

Now think: where does the linearity break down? You can’t sleep 20 hours and get infinite performance. You can’t study 30 hours straight without diminishing returns.

Real-world relationships are often only approximately linear within a certain range. This is the most important limitation of linear regression — and understanding it makes you a better data scientist.

✅ Key Takeaways

🎉 Module 6 Complete!

You can now build both classifiers and regression models. In Module 7, we’ll go deep on evaluation — and learn why accuracy alone can be dangerously misleading.

Continue to Module 7: Evaluating Models →

← Module 5: Decision Trees Module 6 of 8 Module 7: Evaluating Models →