Learn Without Walls
← Module 8 HomeLesson 3 of 4Next Lesson →

Lesson 3: PCA and Dimensionality Reduction

Estimated time: 45-55 minutes

Learning Objectives

The Problem: High-Dimensional Data

Real-world data often has many variables (features). With 100 features, you have data in R^100 -- impossible to visualize. PCA finds a small number of new axes that capture most of the variation in the data.

Principal Component Analysis (PCA): An orthogonal transformation that converts possibly correlated variables into linearly uncorrelated variables called principal components, ordered so that the first few retain most of the variation.

Step 1: Center the Data

Subtract the mean of each variable so the data is centered at the origin. If X is an n x p data matrix (n observations, p variables), compute the mean of each column and subtract it.

Worked Example 1: Centering

Data points in R^2: (1, 2), (3, 4), (5, 6). Means: x-bar = 3, y-bar = 4.

Centered data: (-2, -2), (0, 0), (2, 2).

Step 2: The Covariance Matrix

Sample Covariance Matrix: S = (1/(n-1)) X_c^T X_c, where X_c is the centered data matrix. S is a symmetric p x p matrix.

Diagonal entries of S = variances of each variable. Off-diagonal = covariances between pairs.

Worked Example 2

Centered data matrix X_c = [-2 -2; 0 0; 2 2].

X_c^T X_c = [(-2)^2+0+4, (-2)(-2)+0+4; same, same] = [8 8; 8 8].

S = (1/2)[8 8; 8 8] = [4 4; 4 4]. Both variables have variance 4 and covariance 4 (perfectly correlated).

Step 3: Eigendecomposition

The principal components are the eigenvectors of S, and the eigenvalues tell you how much variance each component explains.

Worked Example 3: Finding Principal Components

S = [4 4; 4 4]. Eigenvalues: lambda^2 - 8*lambda = lambda(lambda - 8) = 0. lambda_1 = 8, lambda_2 = 0.

lambda_1 = 8: eigenvector (1/sqrt(2))(1, 1). This is PC1 -- the direction of maximum variance.

lambda_2 = 0: eigenvector (1/sqrt(2))(1, -1). This is PC2 -- zero variance in this direction (data is perfectly on a line).

Variance explained by PC1: 8/(8+0) = 100%. We can reduce to 1 dimension with no information loss!

Variance Explained

Proportion of Variance: The proportion of total variance explained by the k-th principal component is:

lambda_k / (lambda_1 + lambda_2 + ... + lambda_p)

The cumulative proportion of the first k components measures how much information is retained when projecting to k dimensions.

Worked Example 4: How Many Components?

A dataset with 4 variables has covariance eigenvalues: 10, 5, 2, 0.5. Total = 17.5.

PC1: 10/17.5 = 57.1%. PC2: 5/17.5 = 28.6%. Cumulative: 85.7%.

PC3: 2/17.5 = 11.4%. Cumulative: 97.1%. PC4: 2.9%.

Two components capture 85.7% of variance; three capture 97.1%. A common threshold is 90-95%, suggesting 3 components suffice here.

Step 4: Project the Data

To reduce from p to k dimensions, project each data point onto the first k principal components.

Projection: If W = [v1 | v2 | ... | vk] contains the first k eigenvectors as columns, the reduced data is Y = X_c * W. Each row of Y gives the k-dimensional representation of that data point.

Worked Example 5

From Example 3, project onto PC1 = (1/sqrt(2))(1, 1).

Point (-2, -2): score = (-2, -2) · (1/sqrt(2), 1/sqrt(2)) = -4/sqrt(2) = -2*sqrt(2).

Point (0, 0): score = 0. Point (2, 2): score = 2*sqrt(2).

1D representation: {-2*sqrt(2), 0, 2*sqrt(2)}. All the variation is captured on this one axis.

Connection to SVD

PCA via SVD: If X_c = U Sigma V^T, then the principal components are the columns of V, and the singular values relate to eigenvalues of the covariance matrix by lambda_i = sigma_i^2 / (n-1).

In practice, SVD is the preferred numerical method for computing PCA.

Check Your Understanding

1. Why do we center the data before PCA?

Answer: Centering ensures we are analyzing variance around the mean, not distance from the origin. Without centering, the first PC would point toward the mean, not the direction of maximum variation.

2. Eigenvalues of a covariance matrix are 6, 3, 1. What percent of variance does PC1 explain?

Answer: 6/(6+3+1) = 6/10 = 60%.

3. Why must the covariance matrix be symmetric and positive semi-definite?

Answer: Symmetric because cov(X,Y) = cov(Y,X). Positive semi-definite because variance is nonneg: for any vector w, w^T S w = var(w^T X) ≥ 0. This guarantees all eigenvalues are nonneg.

Key Takeaways

Next Lesson

Computer Graphics Transformations.

Start Lesson 4

Module Home

Module 8 Home