Linear Regression
Predicting numbers instead of categories — “how much?” instead of “which one?”
📌 Before You Start
- Modules 1–5 completed
- Comfortable with the sklearn API from Module 4
- Remember y = mx + b from algebra? That’s the core formula here.
Estimated time: ~50 minutes
What you’ll learn: Regression vs classification, the line of best fit, Mean Squared Error (MSE), and R² score.
💡 The Big Idea
Everything you’ve done so far has predicted a category (setosa? spam? promoted?). Regression predicts a number: “What will the house sell for?” “How many points will this student score?”
Linear regression finds the line of best fit through your data points — the line that minimizes the total distance between itself and all the data points. Once you have that line, any input value maps directly to a predicted output value.
The formula is the same one from algebra class: y = mx + b. The model learns the slope (m) and intercept (b) from your training data.
🧠 How It Works
The Formula
m = slope (how much score increases per hour studied) | b = intercept (predicted score at 0 hours)
Linear regression finds the values of m and b that minimize the sum of squared errors across all training examples. Sklearn does all of this for you in one .fit() call.
How to Measure Success in Regression
You can’t use accuracy for regression (there’s no right/wrong, just “how close?”). Two key metrics:
MSE — Mean Squared Error
Average of (actual − predicted)² across all samples. Lower is better. Units are squared (e.g., points²), which can be hard to interpret.
RMSE = √MSE brings it back to original units (e.g., points).
R² Score (R-squared)
How much of the variance in y does your model explain? Scale: 0 to 1 (or negative for terrible models).
R² = 1.0: perfect fit. R² = 0.8: explains 80% of variance. R² = 0.0: no better than predicting the mean.
Classification vs Regression: Quick Comparison
| Aspect | Classification | Regression |
|---|---|---|
| Output | A category (class) | A number (continuous) |
| Example question | Spam or not? | What’s the sale price? |
| Key metric | Accuracy, F1 | MSE, R² |
| sklearn import | KNeighborsClassifier | LinearRegression |
▶️ See It In Code
A complete linear regression pipeline: fit the line, interpret the formula, evaluate with MSE and R², make a prediction.
👋 Your Turn
Predict the score for someone who studies 12 hours. Run the code, look at the result, and then answer: Does a score above 100 make sense? What does this tell you about using regression to predict values outside the training range?
☕ Brain Break — 2 Minutes
Think about something in your life that follows a roughly linear pattern:
- The more hours you practice piano, the better you get
- The more miles you drive, the more gas you use
- The more sleep you get, the better you perform on tests
Now think: where does the linearity break down? You can’t sleep 20 hours and get infinite performance. You can’t study 30 hours straight without diminishing returns.
Real-world relationships are often only approximately linear within a certain range. This is the most important limitation of linear regression — and understanding it makes you a better data scientist.
✅ Key Takeaways
- Regression predicts a continuous number. Classification predicts a category. Choose based on your target variable.
- Linear regression finds the line y = mx + b that minimizes total squared prediction errors.
- MSE measures average squared error (lower is better). RMSE gives error in original units.
- R² tells you what fraction of variance the model explains. R² = 0.9 means 90% of the target variation is captured by your features.
- Be careful with extrapolation — predicting outside the training range is unreliable and may give physically impossible results.
🎉 Module 6 Complete!
You can now build both classifiers and regression models. In Module 7, we’ll go deep on evaluation — and learn why accuracy alone can be dangerously misleading.