Lesson 4: Least Squares Problems
Estimated time: 45-55 minutes
Learning Objectives
- Understand why overdetermined systems (more equations than unknowns) typically have no exact solution
- Derive and solve the normal equations A^T A x-hat = A^T b
- Apply least squares to find the best-fit line through data points
- Connect least squares to projection: x-hat minimizes ||b - Ax||
The Problem: Inconsistent Systems
When a system Ax = b has more equations than unknowns (overdetermined), there is usually no exact solution. Instead, we find the vector x-hat that makes Ax-hat as close to b as possible.
Least Squares Solution: x-hat minimizes ||b - Ax|| over all x. Equivalently, Ax-hat = proj_{col(A)}(b) -- the projection of b onto the column space of A.
The Normal Equations
The residual b - Ax-hat must be orthogonal to the column space of A. This means A^T(b - Ax-hat) = 0, which gives:
Normal Equations: A^T A x-hat = A^T b.
If A has linearly independent columns, then A^T A is invertible and x-hat = (A^T A)^{-1} A^T b.
Worked Example 1
Solve the least squares problem for Ax = b where A = [1 1; 1 2; 1 3], b = (1, 1, 3).
A^T A = [1 1 1; 1 2 3][1 1; 1 2; 1 3] = [3 6; 6 14].
A^T b = [1 1 1; 1 2 3](1, 1, 3)^T = [5; 12].
Solve [3 6; 6 14][x1; x2] = [5; 12]. From row 1: 3x1 + 6x2 = 5. Row 2 - 2*Row 1: 2x2 = 2, so x2 = 1. Then x1 = (5-6)/3 = -1/3.
Least squares solution: x-hat = (-1/3, 1).
Best-Fit Line
Given data points (t1, y1), ..., (tn, yn), the best-fit line y = c0 + c1*t minimizes the sum of squared residuals.
Setup: Let A = [1 t1; 1 t2; ...; 1 tn] and b = (y1, ..., yn). Solve A^T A x-hat = A^T b for x-hat = (c0, c1).
Worked Example 2: Best-Fit Line
Data: (1, 1), (2, 1), (3, 3). Find the best-fit line y = c0 + c1*t.
A = [1 1; 1 2; 1 3], b = (1, 1, 3). This is exactly Example 1!
x-hat = (-1/3, 1). Best-fit line: y = -1/3 + t.
Check predictions: t=1: y=2/3, t=2: y=5/3, t=3: y=8/3. Residuals: 1/3, -2/3, 1/3.
Sum of squared residuals: 1/9 + 4/9 + 1/9 = 6/9 = 2/3. No other line gives a smaller total.
Geometric Interpretation
The least squares solution projects b onto col(A). The residual e = b - Ax-hat is perpendicular to col(A).
Worked Example 3
From Example 1: Ax-hat = [1 1; 1 2; 1 3](-1/3, 1)^T = (2/3, 5/3, 8/3).
Residual: e = b - Ax-hat = (1/3, -2/3, 1/3).
Check e ⊥ col(A): A^T * e = [1 1 1; 1 2 3](1/3, -2/3, 1/3)^T = [0; 0]. Confirmed!
The least squares error: ||e|| = sqrt(1/9 + 4/9 + 1/9) = sqrt(6/9) = sqrt(6)/3.
Fitting Other Models
Least squares is not limited to lines. Any model that is linear in the parameters can be fit this way.
Worked Example 4: Best-Fit Parabola
Data: (0, 1), (1, 0), (2, 1), (3, 4). Fit y = c0 + c1*t + c2*t^2.
A = [1 0 0; 1 1 1; 1 2 4; 1 3 9], b = (1, 0, 1, 4).
A^T A = [4 6 14; 6 14 36; 14 36 98]. A^T b = [6; 15; 41].
Solving gives approximately c0 = 1, c1 = -1.5, c2 = 0.75.
Best-fit parabola: y = 1 - 1.5t + 0.75t^2.
Connection to Statistics
Linear Regression: The least squares method is the mathematical foundation of linear regression in statistics. The normal equations A^T A x-hat = A^T b give the ordinary least squares (OLS) estimator.
In statistics notation: X^T X beta-hat = X^T y, where X is the design matrix and beta-hat contains the estimated coefficients.
Check Your Understanding
1. Write the normal equations for the least squares problem Ax = b.
2. Data points: (0, 2), (1, 1), (2, 4). Set up the design matrix A and vector b for the best-fit line y = c0 + c1*t.
3. Why must the residual b - Ax-hat be orthogonal to col(A)?
Key Takeaways
- Normal equations: A^T A x-hat = A^T b gives the least squares solution
- x-hat minimizes ||b - Ax|| (the sum of squared residuals)
- Best-fit line: A = [1 t1; ...; 1 tn], solve for (intercept, slope)
- Geometric view: Ax-hat = proj_{col(A)}(b), and b - Ax-hat is in col(A)-perp
- Can fit any model linear in parameters (lines, polynomials, etc.)
- This is the mathematical foundation of linear regression