Module 07: Portfolio Theory is Just Optimization
Mean-variance optimization through the lens of constrained quadratic programming
1. The Markowitz Framework: Mean-Variance Optimization
Modern Portfolio Theory (MPT), introduced by Harry Markowitz in 1952, formulates portfolio construction as a constrained quadratic optimization problem. The investor wants to choose portfolio weights to maximize expected return for a given level of risk, or equivalently, minimize risk for a given level of return.
If you've ever solved a penalized regression (Ridge, LASSO), you've solved a constrained optimization problem. Markowitz optimization is structurally identical to Ridge regression: minimize a quadratic objective (w'Σw, analogous to β'β) subject to a linear constraint (μ'w = target return, analogous to the data-fidelity term). The entire machinery of Lagrange multipliers, KKT conditions, and quadratic programming applies directly.
1.1 The Setup
Consider N assets with:
- μ ∈ RN: the vector of expected returns
- Σ ∈ RN×N: the covariance matrix of returns (symmetric, positive semi-definite)
- w ∈ RN: the vector of portfolio weights (what we're solving for)
The portfolio's expected return and variance are:
1.2 The Optimization Problem
The standard formulation minimizes portfolio variance subject to constraints:
subject to: w'μ = μtarget, w'1 = 1
The first constraint sets the target return; the second ensures weights sum to 1 (fully invested). Additional constraints might include:
- No short-selling: wi ≥ 0 for all i
- Position limits: wi ≤ 0.10 (no more than 10% in any stock)
- Sector constraints: sum of weights in tech stocks ≤ 30%
Long-only constraint: wi ≥ 0 means you can only buy stocks, not sell short. Short selling (wi < 0) means borrowing a stock, selling it, and hoping to buy it back cheaper later. Many institutional mandates prohibit short selling.
2. The Efficient Frontier: Pareto Optimality in (σ, μ) Space
As we sweep the target return μtarget from its minimum to its maximum, each solution traces out a curve in (σ, μ) space. This curve is the efficient frontier — the set of portfolios that offer the maximum expected return for each level of risk.
The efficient frontier is precisely the Pareto frontier of a multi-objective optimization problem (minimize risk, maximize return). In statistics, you encounter Pareto frontiers in bias-variance tradeoffs: the set of estimators that achieve the minimum variance for each level of bias. The efficient frontier is the bias-variance tradeoff of investing.
2.1 The Analytical Solution (No Inequality Constraints)
Without inequality constraints (allowing short selling), the problem is a quadratic program with only equality constraints, and the solution has a closed form via Lagrange multipliers:
Taking the gradient and setting it to zero:
⇒ w* = Σ−1(λ1μ + λ21)
Substituting back into the constraints gives a 2×2 system in (λ1, λ2) that can be solved in closed form. The resulting efficient frontier is a hyperbola in (σ, μ) space.
2.2 Key Constants
Define the following scalars (which appear throughout portfolio theory):
| Constant | Formula | Interpretation |
|---|---|---|
| A | 1'Σ−1μ | How much return the “inverse-variance weighted” portfolio earns |
| B | μ'Σ−1μ | Quadratic form of returns in the precision matrix |
| C | 1'Σ−11 | Total “precision” of the asset universe |
| D | BC − A2 | Discriminant (must be > 0) |
The minimum variance for target return μtarget is:
3. The Covariance Matrix Estimation Problem
Here is where the statistician's alarm bells should start ringing. The optimization requires two inputs: the expected return vector μ and the covariance matrix Σ. Both must be estimated from data, and the estimation is hard.
3.1 The Dimensionality Problem
For N assets, the covariance matrix has N(N+1)/2 unique parameters (it's symmetric). The sample covariance matrix is:
| N (assets) | Parameters to estimate | T needed (10× oversampling) |
|---|---|---|
| 10 | 55 | 550 |
| 50 | 1,275 | 12,750 |
| 100 | 5,050 | 50,500 |
| 500 | 125,250 | 1,252,500 |
| 3,000 | 4,501,500 | 45,015,000 |
With 500 stocks and 5 years of daily data (~1,250 observations), you're estimating 125,250 parameters from 1,250 observations. The sample covariance matrix is singular when T < N, and severely ill-conditioned when T is only slightly larger than N. Plugging this directly into Markowitz optimization produces garbage.
This is exactly the p >> n problem from high-dimensional statistics. The sample covariance matrix is the MLE, but it's a terrible estimator in high dimensions. The eigenvalues of Σ̂ are systematically dispersed relative to the true eigenvalues (the Marcenko-Pastur law describes this). This is why shrinkage, factor models, and regularization are essential in portfolio optimization — for the same reason they're essential in high-dimensional regression.
3.2 Shrinkage Estimators: The Ledoit-Wolf Approach
Ledoit and Wolf (2004) proposed shrinking the sample covariance matrix toward a structured target:
where F is a structured target (e.g., the identity matrix scaled by the average variance, or a single-factor model covariance) and δ ∈ [0, 1] is the optimal shrinkage intensity, determined by minimizing the expected Frobenius loss:
Pythonimport numpy as np
from sklearn.covariance import LedoitWolf
import yfinance as yf
# Download data for a basket of stocks
tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "META",
"JPM", "BAC", "WFC", "GS", "MS",
"JNJ", "PFE", "UNH", "MRK", "ABBV",
"XOM", "CVX", "COP", "SLB", "EOG"]
data = yf.download(tickers, start="2020-01-01", end="2024-01-01")["Adj Close"]
returns = np.log(data / data.shift(1)).dropna()
# Sample covariance matrix
sample_cov = returns.cov().values
# Ledoit-Wolf shrinkage
lw = LedoitWolf().fit(returns.values)
shrunk_cov = lw.covariance_
shrinkage_intensity = lw.shrinkage_
print(f"Number of assets: {len(tickers)}")
print(f"Number of observations: {len(returns)}")
print(f"Parameters in covariance: {len(tickers)*(len(tickers)+1)//2}")
print(f"Ledoit-Wolf shrinkage intensity: {shrinkage_intensity:.4f}")
# Compare condition numbers
cond_sample = np.linalg.cond(sample_cov)
cond_shrunk = np.linalg.cond(shrunk_cov)
print(f"\nCondition number (sample): {cond_sample:.1f}")
print(f"Condition number (shrunk): {cond_shrunk:.1f}")
print(f"Reduction factor: {cond_sample/cond_shrunk:.1f}x")
The shrinkage intensity δ is the analogue of the regularization parameter λ in Ridge regression. High δ means the data is unreliable and we should lean on the prior structure; low δ means the data is informative. The Ledoit-Wolf formula for δ* is an analytical cross-validation formula — no grid search needed.
4. The Global Minimum Variance Portfolio
The simplest Markowitz portfolio ignores expected returns entirely and just minimizes variance:
The analytical solution is:
The GMV portfolio weights are proportional to the row sums of the precision matrix (inverse covariance). Assets that are lowly correlated with others get higher weights. This is analogous to the optimal weighting in meta-analysis, where studies are weighted by the inverse of their variance — more precise estimates get more weight.
4.1 Why the GMV Portfolio Is Appealing
Expected returns are notoriously difficult to estimate. A classic result (Merton, 1980) shows that with T observations, the standard error of the estimated mean return is σ/√T, while the standard error of estimated volatility is σ/√(2T). You need roughly four times as much data to estimate means as precisely as you estimate volatilities.
| Parameter | Standard Error | With T=250 (1 year), σ=20% |
|---|---|---|
| Mean return μ | σ / √T | 1.27% (vs ~8% true mean) |
| Volatility σ | σ / √(2T) | 0.89% (vs 20% true vol) |
The mean is estimated with a standard error that is a significant fraction of the signal, while volatility is estimated much more precisely. The GMV portfolio sidesteps the hardest estimation problem entirely.
Pythondef global_minimum_variance(cov_matrix):
"""Compute the Global Minimum Variance portfolio weights."""
n = cov_matrix.shape[0]
ones = np.ones(n)
inv_cov = np.linalg.inv(cov_matrix)
w = inv_cov @ ones / (ones @ inv_cov @ ones)
return w
# Using the shrunk covariance matrix
w_gmv = global_minimum_variance(shrunk_cov)
print("Global Minimum Variance Portfolio Weights:")
for ticker, weight in zip(tickers, w_gmv):
print(f" {ticker:5s}: {weight:>7.2%}")
# Portfolio statistics
gmv_return = (returns.mean().values @ w_gmv) * 252
gmv_vol = np.sqrt(w_gmv @ shrunk_cov @ w_gmv) * np.sqrt(252)
print(f"\nAnnualized Return: {gmv_return:.4f} ({gmv_return*100:.2f}%)")
print(f"Annualized Volatility: {gmv_vol:.4f} ({gmv_vol*100:.2f}%)")
5. Diversification: Variance Reduction Through Averaging
The fundamental insight of portfolio theory is that combining assets reduces risk. The portfolio variance formula reveals exactly how:
5.1 The Equal-Weight Case
For N assets with equal weights wi = 1/N, equal variances σ2, and equal pairwise correlations ρ:
As N → ∞:
This is identical to averaging correlated estimators. If you average N estimators that each have variance σ2 and pairwise correlation ρ, the variance of the average converges to ρσ2, not zero. The irreducible ρσ2 is systematic risk — it cannot be diversified away. The (1/N)σ2 that vanishes is idiosyncratic risk. In statistics terms: systematic risk is bias, idiosyncratic risk is variance.
Pythonimport matplotlib.pyplot as plt
def diversification_curve(returns_df, n_portfolios=1000):
"""Show how portfolio volatility decreases with number of assets."""
tickers = returns_df.columns.tolist()
n_assets = len(tickers)
results = []
for n in range(1, n_assets + 1):
vols = []
for _ in range(n_portfolios if n < n_assets else 1):
# Random subset of n assets, equal weight
subset = np.random.choice(tickers, n, replace=False)
w = np.ones(n) / n
sub_returns = returns_df[subset].values
port_vol = np.sqrt(
w @ np.cov(sub_returns, rowvar=False) @ w
) * np.sqrt(252)
vols.append(port_vol)
results.append({
'n': n,
'mean_vol': np.mean(vols),
'min_vol': np.min(vols),
'max_vol': np.max(vols)
})
df = pd.DataFrame(results)
plt.figure(figsize=(10, 6))
plt.fill_between(df['n'], df['min_vol']*100, df['max_vol']*100,
alpha=0.2, color='#1a365d')
plt.plot(df['n'], df['mean_vol']*100, 'o-', color='#1a365d',
linewidth=2, markersize=4)
plt.xlabel("Number of Assets in Portfolio")
plt.ylabel("Annualized Volatility (%)")
plt.title("Diversification: Volatility vs Number of Assets")
plt.grid(True, alpha=0.3)
# Add theoretical limit
avg_corr = returns_df.corr().values[np.triu_indices(n_assets, k=1)].mean()
avg_vol = returns_df.std().mean() * np.sqrt(252)
systematic_vol = np.sqrt(avg_corr) * avg_vol * 100
plt.axhline(y=systematic_vol, color='#e53e3e', linestyle='--',
label=f'Systematic risk floor ({systematic_vol:.1f}%)')
plt.legend()
plt.tight_layout()
plt.savefig("diversification_curve.png", dpi=150, bbox_inches='tight')
plt.show()
diversification_curve(returns)
6. Building the Efficient Frontier in Python
Now let's build the efficient frontier numerically using scipy.optimize.
This handles the general case with inequality constraints (long-only, position limits).
Pythonimport numpy as np
import pandas as pd
from scipy.optimize import minimize
import matplotlib.pyplot as plt
# ── Setup ──────────────────────────────────────────────────
mu = returns.mean().values * 252 # annualized expected returns
Sigma = shrunk_cov * 252 # annualized covariance
n_assets = len(tickers)
# ── Portfolio optimization functions ───────────────────────
def portfolio_vol(w, Sigma):
"""Portfolio standard deviation."""
return np.sqrt(w @ Sigma @ w)
def portfolio_ret(w, mu):
"""Portfolio expected return."""
return w @ mu
def neg_sharpe(w, mu, Sigma, rf=0.02):
"""Negative Sharpe ratio (for minimization)."""
ret = w @ mu
vol = np.sqrt(w @ Sigma @ w)
return -(ret - rf) / vol
# ── Constraints and bounds ─────────────────────────────────
constraints = [
{'type': 'eq', 'fun': lambda w: np.sum(w) - 1} # fully invested
]
bounds = [(0, 0.25)] * n_assets # long-only, max 25% per asset
# ── Compute efficient frontier ─────────────────────────────
# Step 1: Find the feasible return range
w0 = np.ones(n_assets) / n_assets # starting point
# Minimum variance portfolio
res_min = minimize(
portfolio_vol, w0, args=(Sigma,),
method='SLSQP', bounds=bounds, constraints=constraints
)
w_min_var = res_min.x
min_ret = portfolio_ret(w_min_var, mu)
min_vol = portfolio_vol(w_min_var, Sigma)
# Maximum return portfolio
res_max = minimize(
lambda w: -portfolio_ret(w, mu),
w0, method='SLSQP', bounds=bounds, constraints=constraints
)
max_ret = portfolio_ret(res_max.x, mu)
# Step 2: Sweep target returns across the feasible range
target_returns = np.linspace(min_ret, max_ret * 0.99, 50)
frontier_vols = []
frontier_weights = []
for target in target_returns:
cons = [
{'type': 'eq', 'fun': lambda w: np.sum(w) - 1},
{'type': 'eq', 'fun': lambda w, t=target: w @ mu - t}
]
res = minimize(
portfolio_vol, w0, args=(Sigma,),
method='SLSQP', bounds=bounds, constraints=cons
)
frontier_vols.append(portfolio_vol(res.x, Sigma))
frontier_weights.append(res.x)
frontier_vols = np.array(frontier_vols)
# Step 3: Find the maximum Sharpe ratio portfolio
res_sharpe = minimize(
neg_sharpe, w0, args=(mu, Sigma, 0.02),
method='SLSQP', bounds=bounds, constraints=constraints
)
w_sharpe = res_sharpe.x
sharpe_ret = portfolio_ret(w_sharpe, mu)
sharpe_vol = portfolio_vol(w_sharpe, Sigma)
# ── Plot ───────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 7))
# Individual assets
for i, ticker in enumerate(tickers):
vol_i = np.sqrt(Sigma[i, i])
ax.scatter(vol_i*100, mu[i]*100, s=40, zorder=5, color='gray')
ax.annotate(ticker, (vol_i*100, mu[i]*100),
fontsize=7, ha='left', va='bottom')
# Efficient frontier
ax.plot(frontier_vols*100, target_returns*100,
'b-', linewidth=2.5, label='Efficient Frontier')
# Special portfolios
ax.scatter(min_vol*100, min_ret*100, s=150, marker='*',
color='green', zorder=10, label='Min Variance')
ax.scatter(sharpe_vol*100, sharpe_ret*100, s=150, marker='*',
color='red', zorder=10, label='Max Sharpe')
# Capital Market Line
rf = 2.0
cml_x = np.linspace(0, max(frontier_vols)*100*1.1, 100)
sharpe_ratio = (sharpe_ret*100 - rf) / (sharpe_vol*100)
cml_y = rf + sharpe_ratio * cml_x
ax.plot(cml_x, cml_y, 'r--', alpha=0.5, label='Capital Market Line')
ax.set_xlabel("Annualized Volatility (%)")
ax.set_ylabel("Annualized Expected Return (%)")
ax.set_title("Efficient Frontier with Long-Only Constraints")
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
ax.set_xlim(left=0)
plt.tight_layout()
plt.savefig("efficient_frontier.png", dpi=150, bbox_inches='tight')
plt.show()
# Print max Sharpe portfolio weights
print("\nMaximum Sharpe Ratio Portfolio:")
print(f" Expected Return: {sharpe_ret*100:.2f}%")
print(f" Volatility: {sharpe_vol*100:.2f}%")
print(f" Sharpe Ratio: {(sharpe_ret-0.02)/sharpe_vol:.4f}")
print(f"\n Weights:")
for ticker, w in zip(tickers, w_sharpe):
if w > 0.001:
print(f" {ticker:5s}: {w:>7.2%}")
The efficient frontier with long-only constraints is always inside the unconstrained frontier (higher risk for the same return). Constraints shrink the feasible set. However, in practice, constrained portfolios often perform better out-of-sample because the constraints act as implicit regularization, preventing the optimizer from exploiting estimation errors.
7. Why 1/N Often Beats “Optimal” Portfolios
A landmark paper by DeMiguel, Garlappi, and Uppal (2009) showed that the naive 1/N (equal-weight) portfolio outperforms most optimized portfolios out-of-sample across a wide range of datasets. This was a shock to the quantitative finance community.
7.1 The Estimation Error Explanation
The Markowitz optimizer is an error maximizer. It overweights assets whose returns are overestimated and whose risks are underestimated, and underweights the opposite. The optimization amplifies estimation errors rather than signal.
This is a textbook example of the bias-variance tradeoff. The Markowitz solution is the “unbiased” optimizer (it targets the true optimum), but it has enormous variance. The 1/N portfolio is biased (it ignores all information about μ and Σ), but it has zero estimation variance. When estimation error is large relative to signal, the biased-but-stable estimator wins. This is exactly why Ridge regression beats OLS when p is large relative to n.
7.2 How Many Observations Do You Need?
DeMiguel et al. derived a formula for the estimation window T* needed for Markowitz to beat 1/N. For N = 50 assets:
Under reasonable calibrations, T* can be 3,000+ months (250+ years!) for 50 assets. We simply do not have enough data for Markowitz optimization to dominate the naive benchmark.
Python# Backtest: Markowitz vs Equal-Weight vs Min Variance
# Using rolling 252-day window, rebalance monthly
def rolling_backtest(returns_df, window=252, rebalance_freq=21):
"""Compare portfolio strategies with rolling estimation."""
n = len(returns_df)
tickers = returns_df.columns
n_assets = len(tickers)
strategies = {
'Equal Weight': [],
'Min Variance (sample)': [],
'Min Variance (shrunk)': [],
}
dates = []
for end in range(window, n, rebalance_freq):
train = returns_df.iloc[end-window:end].values
test = returns_df.iloc[end:min(end+rebalance_freq, n)].values
if len(test) == 0:
continue
dates.extend(returns_df.index[end:min(end+rebalance_freq, n)])
# Equal weight
w_ew = np.ones(n_assets) / n_assets
# Min variance (sample covariance)
cov_sample = np.cov(train, rowvar=False)
try:
w_mv_sample = global_minimum_variance(cov_sample)
w_mv_sample = np.clip(w_mv_sample, 0, None)
w_mv_sample /= w_mv_sample.sum()
except np.linalg.LinAlgError:
w_mv_sample = w_ew
# Min variance (Ledoit-Wolf shrinkage)
lw = LedoitWolf().fit(train)
w_mv_shrunk = global_minimum_variance(lw.covariance_)
w_mv_shrunk = np.clip(w_mv_shrunk, 0, None)
w_mv_shrunk /= w_mv_shrunk.sum()
# Compute daily returns for the test period
for day in test:
strategies['Equal Weight'].append(day @ w_ew)
strategies['Min Variance (sample)'].append(day @ w_mv_sample)
strategies['Min Variance (shrunk)'].append(day @ w_mv_shrunk)
# Build cumulative returns
results = pd.DataFrame(strategies, index=dates[:len(strategies['Equal Weight'])])
cumulative = (1 + results).cumprod()
return results, cumulative
results, cumulative = rolling_backtest(returns)
# Plot cumulative returns
plt.figure(figsize=(10, 6))
for col in cumulative.columns:
plt.plot(cumulative[col], label=col, linewidth=1.5)
plt.title("Cumulative Returns: Equal Weight vs Min Variance")
plt.ylabel("Growth of $1")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("portfolio_backtest.png", dpi=150, bbox_inches='tight')
plt.show()
# Summary statistics
print("\nAnnualized Performance Summary:")
for col in results.columns:
r = results[col]
ann_ret = r.mean() * 252
ann_vol = r.std() * np.sqrt(252)
sharpe = ann_ret / ann_vol
cum_ret = cumulative[col].iloc[-1] - 1
print(f" {col:30s}: ret={ann_ret*100:>6.2f}% "
f"vol={ann_vol*100:>6.2f}% sharpe={sharpe:>5.2f} "
f"cumul={cum_ret*100:>6.1f}%")
Backtests of Markowitz optimization using the full sample covariance matrix will always look impressive because the optimizer has access to future information. Always use a rolling window or expanding window to simulate realistic out-of-sample conditions. The gap between in-sample and out-of-sample performance is the cost of estimation error.
8. Practical Considerations and Extensions
8.1 Transaction Costs and Turnover
Markowitz optimization can produce wildly different weights from one period to the next, generating high turnover and transaction costs. In practice, portfolios are rebalanced with a turnover penalty:
The L1 penalty on the weight change is analogous to LASSO — it encourages sparse updates, keeping most weights at their previous values.
8.2 Black-Litterman Model
The Black-Litterman (1992) model takes a Bayesian approach to the expected return estimation problem. Instead of estimating μ from historical data, it starts with a prior (the market equilibrium returns implied by the CAPM) and updates it with the investor's subjective views:
This is a Bayesian posterior mean — literally the formula for the posterior mean in a conjugate normal-normal model. The prior is Π (equilibrium returns) with precision (τΣ)−1, and the likelihood is the investor's views Q with precision Ω−1. The Black-Litterman model is Bayesian shrinkage applied to expected returns.
8.3 Risk Parity
Risk parity allocates weights so that each asset contributes equally to total portfolio risk. The risk contribution of asset i is:
Risk parity sets RCi = RCj for all i, j.
9. Chapter Summary
Portfolio theory is optimization, and optimization is only as good as its inputs:
- Markowitz mean-variance optimization is a constrained quadratic program — the same machinery as Ridge regression and penalized estimation.
- The efficient frontier is a Pareto frontier in risk-return space, analogous to the bias-variance tradeoff curve.
- Covariance estimation is the critical bottleneck. With N assets, you need to estimate N(N+1)/2 parameters, and the sample covariance is a poor estimator in high dimensions.
- Shrinkage estimators (Ledoit-Wolf) and structured models dramatically improve covariance estimation — the same regularization you use in high-dimensional statistics.
- The global minimum variance portfolio avoids the hardest estimation problem (expected returns) and often performs well.
- Equal-weight portfolios beat “optimal” ones when estimation error is large — a direct application of the bias-variance tradeoff.
- Diversification is variance reduction through averaging correlated estimators, with systematic risk as the irreducible floor.
The deep lesson of portfolio theory for statisticians: the optimal solution to a problem with estimated parameters is not the same as the solution you get by plugging estimates into the formula for the optimal solution. Estimation uncertainty changes the optimal strategy. This principle — called the “separation of estimation and optimization” — recurs throughout quantitative finance and statistical decision theory.