Learn Without Walls

Module 07: Portfolio Theory is Just Optimization

Mean-variance optimization through the lens of constrained quadratic programming

Part 2 of 5 Module 07 of 22

1. The Markowitz Framework: Mean-Variance Optimization

Modern Portfolio Theory (MPT), introduced by Harry Markowitz in 1952, formulates portfolio construction as a constrained quadratic optimization problem. The investor wants to choose portfolio weights to maximize expected return for a given level of risk, or equivalently, minimize risk for a given level of return.

Stats Bridge

If you've ever solved a penalized regression (Ridge, LASSO), you've solved a constrained optimization problem. Markowitz optimization is structurally identical to Ridge regression: minimize a quadratic objective (w'Σw, analogous to β'β) subject to a linear constraint (μ'w = target return, analogous to the data-fidelity term). The entire machinery of Lagrange multipliers, KKT conditions, and quadratic programming applies directly.

1.1 The Setup

Consider N assets with:

The portfolio's expected return and variance are:

E[Rp] = w'μ        Var(Rp) = w'Σw

1.2 The Optimization Problem

The standard formulation minimizes portfolio variance subject to constraints:

minw ½ w'Σw
subject to:   w'μ = μtarget,   w'1 = 1

The first constraint sets the target return; the second ensures weights sum to 1 (fully invested). Additional constraints might include:

Finance Term

Long-only constraint: wi ≥ 0 means you can only buy stocks, not sell short. Short selling (wi < 0) means borrowing a stock, selling it, and hoping to buy it back cheaper later. Many institutional mandates prohibit short selling.

2. The Efficient Frontier: Pareto Optimality in (σ, μ) Space

As we sweep the target return μtarget from its minimum to its maximum, each solution traces out a curve in (σ, μ) space. This curve is the efficient frontier — the set of portfolios that offer the maximum expected return for each level of risk.

Stats Bridge

The efficient frontier is precisely the Pareto frontier of a multi-objective optimization problem (minimize risk, maximize return). In statistics, you encounter Pareto frontiers in bias-variance tradeoffs: the set of estimators that achieve the minimum variance for each level of bias. The efficient frontier is the bias-variance tradeoff of investing.

2.1 The Analytical Solution (No Inequality Constraints)

Without inequality constraints (allowing short selling), the problem is a quadratic program with only equality constraints, and the solution has a closed form via Lagrange multipliers:

L(w, λ1, λ2) = ½ w'Σw − λ1(w'μ − μtarget) − λ2(w'1 − 1)

Taking the gradient and setting it to zero:

wL = Σw − λ1μ − λ21 = 0
⇒ w* = Σ−11μ + λ21)

Substituting back into the constraints gives a 2×2 system in (λ1, λ2) that can be solved in closed form. The resulting efficient frontier is a hyperbola in (σ, μ) space.

2.2 Key Constants

Define the following scalars (which appear throughout portfolio theory):

Constant Formula Interpretation
A 1'Σ−1μ How much return the “inverse-variance weighted” portfolio earns
B μ'Σ−1μ Quadratic form of returns in the precision matrix
C 1'Σ−11 Total “precision” of the asset universe
D BC − A2 Discriminant (must be > 0)

The minimum variance for target return μtarget is:

σ2min = (C ⋅ μtarget2 − 2A ⋅ μtarget + B) / D

3. The Covariance Matrix Estimation Problem

Here is where the statistician's alarm bells should start ringing. The optimization requires two inputs: the expected return vector μ and the covariance matrix Σ. Both must be estimated from data, and the estimation is hard.

3.1 The Dimensionality Problem

For N assets, the covariance matrix has N(N+1)/2 unique parameters (it's symmetric). The sample covariance matrix is:

Σ̂ = (1/(T−1)) ∑t=1T (rt − r̄)(rt − r̄)'
N (assets) Parameters to estimate T needed (10× oversampling)
1055550
501,27512,750
1005,05050,500
500125,2501,252,500
3,0004,501,50045,015,000
Common Pitfall

With 500 stocks and 5 years of daily data (~1,250 observations), you're estimating 125,250 parameters from 1,250 observations. The sample covariance matrix is singular when T < N, and severely ill-conditioned when T is only slightly larger than N. Plugging this directly into Markowitz optimization produces garbage.

Stats Bridge

This is exactly the p >> n problem from high-dimensional statistics. The sample covariance matrix is the MLE, but it's a terrible estimator in high dimensions. The eigenvalues of Σ̂ are systematically dispersed relative to the true eigenvalues (the Marcenko-Pastur law describes this). This is why shrinkage, factor models, and regularization are essential in portfolio optimization — for the same reason they're essential in high-dimensional regression.

3.2 Shrinkage Estimators: The Ledoit-Wolf Approach

Ledoit and Wolf (2004) proposed shrinking the sample covariance matrix toward a structured target:

Σ̂shrink = δ ⋅ F + (1 − δ) ⋅ Σ̂

where F is a structured target (e.g., the identity matrix scaled by the average variance, or a single-factor model covariance) and δ ∈ [0, 1] is the optimal shrinkage intensity, determined by minimizing the expected Frobenius loss:

δ* = argminδ E[ ||δF + (1−δ)Σ̂ − Σ||F2 ]
Pythonimport numpy as np
from sklearn.covariance import LedoitWolf
import yfinance as yf

# Download data for a basket of stocks
tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "META",
           "JPM", "BAC", "WFC", "GS", "MS",
           "JNJ", "PFE", "UNH", "MRK", "ABBV",
           "XOM", "CVX", "COP", "SLB", "EOG"]

data = yf.download(tickers, start="2020-01-01", end="2024-01-01")["Adj Close"]
returns = np.log(data / data.shift(1)).dropna()

# Sample covariance matrix
sample_cov = returns.cov().values

# Ledoit-Wolf shrinkage
lw = LedoitWolf().fit(returns.values)
shrunk_cov = lw.covariance_
shrinkage_intensity = lw.shrinkage_

print(f"Number of assets: {len(tickers)}")
print(f"Number of observations: {len(returns)}")
print(f"Parameters in covariance: {len(tickers)*(len(tickers)+1)//2}")
print(f"Ledoit-Wolf shrinkage intensity: {shrinkage_intensity:.4f}")

# Compare condition numbers
cond_sample = np.linalg.cond(sample_cov)
cond_shrunk = np.linalg.cond(shrunk_cov)
print(f"\nCondition number (sample):  {cond_sample:.1f}")
print(f"Condition number (shrunk):  {cond_shrunk:.1f}")
print(f"Reduction factor:           {cond_sample/cond_shrunk:.1f}x")
Key Insight

The shrinkage intensity δ is the analogue of the regularization parameter λ in Ridge regression. High δ means the data is unreliable and we should lean on the prior structure; low δ means the data is informative. The Ledoit-Wolf formula for δ* is an analytical cross-validation formula — no grid search needed.

4. The Global Minimum Variance Portfolio

The simplest Markowitz portfolio ignores expected returns entirely and just minimizes variance:

minw w'Σw     subject to: w'1 = 1

The analytical solution is:

wGMV = Σ−11 / (1'Σ−11)
Stats Bridge

The GMV portfolio weights are proportional to the row sums of the precision matrix (inverse covariance). Assets that are lowly correlated with others get higher weights. This is analogous to the optimal weighting in meta-analysis, where studies are weighted by the inverse of their variance — more precise estimates get more weight.

4.1 Why the GMV Portfolio Is Appealing

Expected returns are notoriously difficult to estimate. A classic result (Merton, 1980) shows that with T observations, the standard error of the estimated mean return is σ/√T, while the standard error of estimated volatility is σ/√(2T). You need roughly four times as much data to estimate means as precisely as you estimate volatilities.

Parameter Standard Error With T=250 (1 year), σ=20%
Mean return μ σ / √T 1.27% (vs ~8% true mean)
Volatility σ σ / √(2T) 0.89% (vs 20% true vol)

The mean is estimated with a standard error that is a significant fraction of the signal, while volatility is estimated much more precisely. The GMV portfolio sidesteps the hardest estimation problem entirely.

Pythondef global_minimum_variance(cov_matrix):
    """Compute the Global Minimum Variance portfolio weights."""
    n = cov_matrix.shape[0]
    ones = np.ones(n)
    inv_cov = np.linalg.inv(cov_matrix)
    w = inv_cov @ ones / (ones @ inv_cov @ ones)
    return w

# Using the shrunk covariance matrix
w_gmv = global_minimum_variance(shrunk_cov)

print("Global Minimum Variance Portfolio Weights:")
for ticker, weight in zip(tickers, w_gmv):
    print(f"  {ticker:5s}: {weight:>7.2%}")

# Portfolio statistics
gmv_return = (returns.mean().values @ w_gmv) * 252
gmv_vol = np.sqrt(w_gmv @ shrunk_cov @ w_gmv) * np.sqrt(252)
print(f"\nAnnualized Return:     {gmv_return:.4f} ({gmv_return*100:.2f}%)")
print(f"Annualized Volatility: {gmv_vol:.4f} ({gmv_vol*100:.2f}%)")

5. Diversification: Variance Reduction Through Averaging

The fundamental insight of portfolio theory is that combining assets reduces risk. The portfolio variance formula reveals exactly how:

Var(Rp) = ∑i wi2 σi2 + ∑ij≠i wiwj σiσj ρij

5.1 The Equal-Weight Case

For N assets with equal weights wi = 1/N, equal variances σ2, and equal pairwise correlations ρ:

Var(Rp) = (1/N) ⋅ σ2 + (1 − 1/N) ⋅ ρ ⋅ σ2

As N → ∞:

Var(Rp) → ρ ⋅ σ2
Stats Bridge

This is identical to averaging correlated estimators. If you average N estimators that each have variance σ2 and pairwise correlation ρ, the variance of the average converges to ρσ2, not zero. The irreducible ρσ2 is systematic risk — it cannot be diversified away. The (1/N)σ2 that vanishes is idiosyncratic risk. In statistics terms: systematic risk is bias, idiosyncratic risk is variance.

Pythonimport matplotlib.pyplot as plt

def diversification_curve(returns_df, n_portfolios=1000):
    """Show how portfolio volatility decreases with number of assets."""
    tickers = returns_df.columns.tolist()
    n_assets = len(tickers)
    results = []

    for n in range(1, n_assets + 1):
        vols = []
        for _ in range(n_portfolios if n < n_assets else 1):
            # Random subset of n assets, equal weight
            subset = np.random.choice(tickers, n, replace=False)
            w = np.ones(n) / n
            sub_returns = returns_df[subset].values
            port_vol = np.sqrt(
                w @ np.cov(sub_returns, rowvar=False) @ w
            ) * np.sqrt(252)
            vols.append(port_vol)
        results.append({
            'n': n,
            'mean_vol': np.mean(vols),
            'min_vol': np.min(vols),
            'max_vol': np.max(vols)
        })

    df = pd.DataFrame(results)

    plt.figure(figsize=(10, 6))
    plt.fill_between(df['n'], df['min_vol']*100, df['max_vol']*100,
                     alpha=0.2, color='#1a365d')
    plt.plot(df['n'], df['mean_vol']*100, 'o-', color='#1a365d',
             linewidth=2, markersize=4)
    plt.xlabel("Number of Assets in Portfolio")
    plt.ylabel("Annualized Volatility (%)")
    plt.title("Diversification: Volatility vs Number of Assets")
    plt.grid(True, alpha=0.3)

    # Add theoretical limit
    avg_corr = returns_df.corr().values[np.triu_indices(n_assets, k=1)].mean()
    avg_vol = returns_df.std().mean() * np.sqrt(252)
    systematic_vol = np.sqrt(avg_corr) * avg_vol * 100
    plt.axhline(y=systematic_vol, color='#e53e3e', linestyle='--',
                label=f'Systematic risk floor ({systematic_vol:.1f}%)')
    plt.legend()

    plt.tight_layout()
    plt.savefig("diversification_curve.png", dpi=150, bbox_inches='tight')
    plt.show()

diversification_curve(returns)

6. Building the Efficient Frontier in Python

Now let's build the efficient frontier numerically using scipy.optimize. This handles the general case with inequality constraints (long-only, position limits).

Pythonimport numpy as np
import pandas as pd
from scipy.optimize import minimize
import matplotlib.pyplot as plt

# ── Setup ──────────────────────────────────────────────────
mu = returns.mean().values * 252          # annualized expected returns
Sigma = shrunk_cov * 252                  # annualized covariance
n_assets = len(tickers)

# ── Portfolio optimization functions ───────────────────────
def portfolio_vol(w, Sigma):
    """Portfolio standard deviation."""
    return np.sqrt(w @ Sigma @ w)

def portfolio_ret(w, mu):
    """Portfolio expected return."""
    return w @ mu

def neg_sharpe(w, mu, Sigma, rf=0.02):
    """Negative Sharpe ratio (for minimization)."""
    ret = w @ mu
    vol = np.sqrt(w @ Sigma @ w)
    return -(ret - rf) / vol

# ── Constraints and bounds ─────────────────────────────────
constraints = [
    {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}  # fully invested
]
bounds = [(0, 0.25)] * n_assets  # long-only, max 25% per asset

# ── Compute efficient frontier ─────────────────────────────
# Step 1: Find the feasible return range
w0 = np.ones(n_assets) / n_assets  # starting point

# Minimum variance portfolio
res_min = minimize(
    portfolio_vol, w0, args=(Sigma,),
    method='SLSQP', bounds=bounds, constraints=constraints
)
w_min_var = res_min.x
min_ret = portfolio_ret(w_min_var, mu)
min_vol = portfolio_vol(w_min_var, Sigma)

# Maximum return portfolio
res_max = minimize(
    lambda w: -portfolio_ret(w, mu),
    w0, method='SLSQP', bounds=bounds, constraints=constraints
)
max_ret = portfolio_ret(res_max.x, mu)

# Step 2: Sweep target returns across the feasible range
target_returns = np.linspace(min_ret, max_ret * 0.99, 50)
frontier_vols = []
frontier_weights = []

for target in target_returns:
    cons = [
        {'type': 'eq', 'fun': lambda w: np.sum(w) - 1},
        {'type': 'eq', 'fun': lambda w, t=target: w @ mu - t}
    ]
    res = minimize(
        portfolio_vol, w0, args=(Sigma,),
        method='SLSQP', bounds=bounds, constraints=cons
    )
    frontier_vols.append(portfolio_vol(res.x, Sigma))
    frontier_weights.append(res.x)

frontier_vols = np.array(frontier_vols)

# Step 3: Find the maximum Sharpe ratio portfolio
res_sharpe = minimize(
    neg_sharpe, w0, args=(mu, Sigma, 0.02),
    method='SLSQP', bounds=bounds, constraints=constraints
)
w_sharpe = res_sharpe.x
sharpe_ret = portfolio_ret(w_sharpe, mu)
sharpe_vol = portfolio_vol(w_sharpe, Sigma)

# ── Plot ───────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 7))

# Individual assets
for i, ticker in enumerate(tickers):
    vol_i = np.sqrt(Sigma[i, i])
    ax.scatter(vol_i*100, mu[i]*100, s=40, zorder=5, color='gray')
    ax.annotate(ticker, (vol_i*100, mu[i]*100),
                fontsize=7, ha='left', va='bottom')

# Efficient frontier
ax.plot(frontier_vols*100, target_returns*100,
        'b-', linewidth=2.5, label='Efficient Frontier')

# Special portfolios
ax.scatter(min_vol*100, min_ret*100, s=150, marker='*',
           color='green', zorder=10, label='Min Variance')
ax.scatter(sharpe_vol*100, sharpe_ret*100, s=150, marker='*',
           color='red', zorder=10, label='Max Sharpe')

# Capital Market Line
rf = 2.0
cml_x = np.linspace(0, max(frontier_vols)*100*1.1, 100)
sharpe_ratio = (sharpe_ret*100 - rf) / (sharpe_vol*100)
cml_y = rf + sharpe_ratio * cml_x
ax.plot(cml_x, cml_y, 'r--', alpha=0.5, label='Capital Market Line')

ax.set_xlabel("Annualized Volatility (%)")
ax.set_ylabel("Annualized Expected Return (%)")
ax.set_title("Efficient Frontier with Long-Only Constraints")
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
ax.set_xlim(left=0)

plt.tight_layout()
plt.savefig("efficient_frontier.png", dpi=150, bbox_inches='tight')
plt.show()

# Print max Sharpe portfolio weights
print("\nMaximum Sharpe Ratio Portfolio:")
print(f"  Expected Return: {sharpe_ret*100:.2f}%")
print(f"  Volatility:      {sharpe_vol*100:.2f}%")
print(f"  Sharpe Ratio:    {(sharpe_ret-0.02)/sharpe_vol:.4f}")
print(f"\n  Weights:")
for ticker, w in zip(tickers, w_sharpe):
    if w > 0.001:
        print(f"    {ticker:5s}: {w:>7.2%}")
Key Insight

The efficient frontier with long-only constraints is always inside the unconstrained frontier (higher risk for the same return). Constraints shrink the feasible set. However, in practice, constrained portfolios often perform better out-of-sample because the constraints act as implicit regularization, preventing the optimizer from exploiting estimation errors.

7. Why 1/N Often Beats “Optimal” Portfolios

A landmark paper by DeMiguel, Garlappi, and Uppal (2009) showed that the naive 1/N (equal-weight) portfolio outperforms most optimized portfolios out-of-sample across a wide range of datasets. This was a shock to the quantitative finance community.

7.1 The Estimation Error Explanation

The Markowitz optimizer is an error maximizer. It overweights assets whose returns are overestimated and whose risks are underestimated, and underweights the opposite. The optimization amplifies estimation errors rather than signal.

Stats Bridge

This is a textbook example of the bias-variance tradeoff. The Markowitz solution is the “unbiased” optimizer (it targets the true optimum), but it has enormous variance. The 1/N portfolio is biased (it ignores all information about μ and Σ), but it has zero estimation variance. When estimation error is large relative to signal, the biased-but-stable estimator wins. This is exactly why Ridge regression beats OLS when p is large relative to n.

7.2 How Many Observations Do You Need?

DeMiguel et al. derived a formula for the estimation window T* needed for Markowitz to beat 1/N. For N = 50 assets:

T* ≈ (N2 ⋅ μmax2) / (σmin2 ⋅ SR2)

Under reasonable calibrations, T* can be 3,000+ months (250+ years!) for 50 assets. We simply do not have enough data for Markowitz optimization to dominate the naive benchmark.

Python# Backtest: Markowitz vs Equal-Weight vs Min Variance
# Using rolling 252-day window, rebalance monthly

def rolling_backtest(returns_df, window=252, rebalance_freq=21):
    """Compare portfolio strategies with rolling estimation."""
    n = len(returns_df)
    tickers = returns_df.columns
    n_assets = len(tickers)

    strategies = {
        'Equal Weight': [],
        'Min Variance (sample)': [],
        'Min Variance (shrunk)': [],
    }
    dates = []

    for end in range(window, n, rebalance_freq):
        train = returns_df.iloc[end-window:end].values
        test = returns_df.iloc[end:min(end+rebalance_freq, n)].values

        if len(test) == 0:
            continue

        dates.extend(returns_df.index[end:min(end+rebalance_freq, n)])

        # Equal weight
        w_ew = np.ones(n_assets) / n_assets

        # Min variance (sample covariance)
        cov_sample = np.cov(train, rowvar=False)
        try:
            w_mv_sample = global_minimum_variance(cov_sample)
            w_mv_sample = np.clip(w_mv_sample, 0, None)
            w_mv_sample /= w_mv_sample.sum()
        except np.linalg.LinAlgError:
            w_mv_sample = w_ew

        # Min variance (Ledoit-Wolf shrinkage)
        lw = LedoitWolf().fit(train)
        w_mv_shrunk = global_minimum_variance(lw.covariance_)
        w_mv_shrunk = np.clip(w_mv_shrunk, 0, None)
        w_mv_shrunk /= w_mv_shrunk.sum()

        # Compute daily returns for the test period
        for day in test:
            strategies['Equal Weight'].append(day @ w_ew)
            strategies['Min Variance (sample)'].append(day @ w_mv_sample)
            strategies['Min Variance (shrunk)'].append(day @ w_mv_shrunk)

    # Build cumulative returns
    results = pd.DataFrame(strategies, index=dates[:len(strategies['Equal Weight'])])
    cumulative = (1 + results).cumprod()

    return results, cumulative

results, cumulative = rolling_backtest(returns)

# Plot cumulative returns
plt.figure(figsize=(10, 6))
for col in cumulative.columns:
    plt.plot(cumulative[col], label=col, linewidth=1.5)
plt.title("Cumulative Returns: Equal Weight vs Min Variance")
plt.ylabel("Growth of $1")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("portfolio_backtest.png", dpi=150, bbox_inches='tight')
plt.show()

# Summary statistics
print("\nAnnualized Performance Summary:")
for col in results.columns:
    r = results[col]
    ann_ret = r.mean() * 252
    ann_vol = r.std() * np.sqrt(252)
    sharpe = ann_ret / ann_vol
    cum_ret = cumulative[col].iloc[-1] - 1
    print(f"  {col:30s}: ret={ann_ret*100:>6.2f}%  "
          f"vol={ann_vol*100:>6.2f}%  sharpe={sharpe:>5.2f}  "
          f"cumul={cum_ret*100:>6.1f}%")
Common Pitfall

Backtests of Markowitz optimization using the full sample covariance matrix will always look impressive because the optimizer has access to future information. Always use a rolling window or expanding window to simulate realistic out-of-sample conditions. The gap between in-sample and out-of-sample performance is the cost of estimation error.

8. Practical Considerations and Extensions

8.1 Transaction Costs and Turnover

Markowitz optimization can produce wildly different weights from one period to the next, generating high turnover and transaction costs. In practice, portfolios are rebalanced with a turnover penalty:

minw w'Σw + λ ⋅ ||w − wprev||1

The L1 penalty on the weight change is analogous to LASSO — it encourages sparse updates, keeping most weights at their previous values.

8.2 Black-Litterman Model

The Black-Litterman (1992) model takes a Bayesian approach to the expected return estimation problem. Instead of estimating μ from historical data, it starts with a prior (the market equilibrium returns implied by the CAPM) and updates it with the investor's subjective views:

μBL = [(τΣ)−1 + P'Ω−1P]−1 [(τΣ)−1Π + P'Ω−1Q]
Stats Bridge

This is a Bayesian posterior mean — literally the formula for the posterior mean in a conjugate normal-normal model. The prior is Π (equilibrium returns) with precision (τΣ)−1, and the likelihood is the investor's views Q with precision Ω−1. The Black-Litterman model is Bayesian shrinkage applied to expected returns.

8.3 Risk Parity

Risk parity allocates weights so that each asset contributes equally to total portfolio risk. The risk contribution of asset i is:

RCi = wi ⋅ (Σw)i / √(w'Σw)

Risk parity sets RCi = RCj for all i, j.

9. Chapter Summary

Portfolio theory is optimization, and optimization is only as good as its inputs:

Key Insight

The deep lesson of portfolio theory for statisticians: the optimal solution to a problem with estimated parameters is not the same as the solution you get by plugging estimates into the formula for the optimal solution. Estimation uncertainty changes the optimal strategy. This principle — called the “separation of estimation and optimization” — recurs throughout quantitative finance and statistical decision theory.