Module 18: Backtesting & Strategy Evaluation
Judging investment strategies with the rigor of statistical hypothesis testing
Introduction: Judging Strategies Like a Statistician
Backtesting is the process of evaluating a trading strategy on historical data. It is the financial equivalent of out-of-sample model validation — but with far more ways to go wrong. A beautifully backtested strategy that earned 40% per year with a Sharpe of 3.0 might be nothing more than an artifact of overfitting, data snooping, and biased simulation. This module equips you to evaluate strategies with the same rigor you would apply to a statistical study.
Backtesting is hypothesis testing applied to investment strategies. The null hypothesis is that the strategy has no alpha (excess risk-adjusted return). The test statistic is the Sharpe ratio. The critical question is always: is the observed performance statistically distinguishable from luck? Every concept in this module maps to a familiar statistical idea.
1. The Sharpe Ratio: Finance's Signal-to-Noise Ratio
1.1 Definition
where Rp is the portfolio return, Rf is the risk-free rate, μexcess is the mean excess return, and σexcess is the standard deviation of excess returns.
The Sharpe ratio is literally the signal-to-noise ratio (SNR) of the excess return process. It is also proportional to the t-statistic for testing whether the mean excess return is significantly different from zero: t = SR × √n, where n is the number of observations. A Sharpe ratio of 0.5 per year with 10 years of data gives t = 0.5 × √10 ≈ 1.58, which is not significant at the 5% level.
1.2 Interpreting the Sharpe Ratio
| Annualized Sharpe | Quality | Context |
|---|---|---|
| < 0.5 | Poor | Not distinguishable from random noise |
| 0.5 – 1.0 | Acceptable | Typical for long-only equity strategies |
| 1.0 – 2.0 | Good | Achievable by well-designed quantitative strategies |
| 2.0 – 3.0 | Excellent | Top-tier hedge funds; warrants scrutiny for overfitting |
| > 3.0 | Suspicious | Very likely backtest artifact; almost never achieved live |
1.3 Annualizing the Sharpe Ratio
This scaling assumes returns are independently and identically distributed (IID). The √252 factor comes from the fact that the variance of the sum of n IID variables scales as n, so the standard deviation scales as √n, while the mean scales as n. Thus the ratio of mean to standard deviation scales as √n.
The IID assumption behind √252 scaling is wrong for most strategies. If returns are positively autocorrelated (momentum strategies), the annualized Sharpe is overstated. If returns are negatively autocorrelated (mean-reversion strategies), it is understated. The correction factor involves the autocorrelation structure: SRannual = SRdaily × √(252 × (1 + 2∑(1−k/252)ρk)).
1.4 Statistical Significance of the Sharpe Ratio
Under the null hypothesis that the true Sharpe ratio is zero, the estimated Sharpe ratio follows approximately:
Simplified: SE(SR̂) ≈ 1/√n (for small SR)
This means you need years of data to confirm a genuine Sharpe ratio. With daily data:
| True Annual Sharpe | Daily Sharpe | Years for t=2 (p<0.05) |
|---|---|---|
| 0.5 | 0.031 | 16 years |
| 1.0 | 0.063 | 4 years |
| 2.0 | 0.126 | 1 year |
A strategy with an annualized Sharpe of 0.5 requires 16 years of daily data to be statistically significant. Most backtests span 5–10 years. This means that for many realistic Sharpe ratios, we literally do not have enough data to distinguish signal from noise. This sobering fact underlies the entire difficulty of strategy evaluation.
2. The Sortino Ratio: Penalizing Only Downside
2.1 Motivation
The Sharpe ratio treats upside and downside volatility equally. But investors do not care about upside volatility — they only fear losses. The Sortino ratio replaces total standard deviation with downside deviation.
where σdownside = √(E[min(Rp − Rf, 0)2])
The Sortino ratio is the SNR computed using only the lower partial moment of the distribution. If returns are symmetric (normal), Sortino ≈ Sharpe × √2. If returns are negatively skewed (common in finance), Sortino < Sharpe × √2, meaning the downside is worse than the symmetric case would suggest. A strategy that has Sharpe = 1.5 but Sortino = 1.0 is generating its returns with significant left-tail risk.
3. Maximum Drawdown: A Path-Dependent Risk Measure
3.1 Definition
Maximum drawdown (MDD) is the largest peak-to-trough decline in portfolio value over the entire backtest period.
Drawdown — The decline from a previous peak in portfolio value. It measures how much an investor would have lost if they bought at the peak and sold at the trough. Maximum drawdown is the worst such experience over the entire backtest.
3.2 Why Drawdown Matters More Than Volatility
Volatility is a statistical abstraction; drawdown is a lived experience. A strategy with 15% annualized volatility sounds moderate, but if it experiences a 45% drawdown, most investors will panic-sell at the bottom. Key relationships:
- For a random walk with annual volatility σ, the expected maximum drawdown over T years is approximately σ × √(2T × ln(T)).
- Drawdowns are path-dependent: two return series with identical means and variances can have very different drawdown profiles.
- Drawdown duration (how long it takes to recover) matters as much as drawdown depth.
3.3 Calmar Ratio
The Calmar ratio rewards returns and penalizes drawdowns. A Calmar above 1.0 is considered good; above 2.0 is excellent. Like the Sharpe ratio, very high Calmar ratios in a backtest should trigger suspicion of overfitting.
Maximum drawdown is a path statistic — it depends on the entire trajectory, not just the marginal distribution. Formally, it is related to the running maximum of a stochastic process. For a Brownian motion with drift μ and volatility σ, the distribution of maximum drawdown can be computed analytically, providing a benchmark to assess whether an observed drawdown is unusually large or small.
4. Complete Performance Metrics Reference
| Metric | Formula | What It Measures | Statistical Analogue |
|---|---|---|---|
| Sharpe Ratio | μ / σ | Return per unit of total risk | Signal-to-noise ratio; t-stat / √n |
| Sortino Ratio | μ / σdown | Return per unit of downside risk | SNR using lower partial moment |
| Calmar Ratio | Ann. Return / MDD | Return per unit of worst drawdown | Mean / path maximum statistic |
| Information Ratio | α / σ(α) | Active return per tracking error | t-stat of regression intercept |
| Omega Ratio | ∫ gains / ∫ losses | Probability-weighted gain/loss | Ratio of upper to lower partial moments |
| Win Rate | P(R > 0) | Fraction of profitable days | Binomial parameter p |
| Profit Factor | ∑ wins / ∑ losses | Total profit relative to total loss | Ratio of positive to negative mean |
| Skewness | E[(R − μ)³] / σ³ | Asymmetry of returns | Third standardized moment |
| Kurtosis | E[(R − μ)⁴] / σ⁴ | Tail heaviness | Fourth standardized moment |
5. The Multiple Testing Problem in Strategy Search
5.1 The Setup
A quantitative researcher tests N candidate strategies and reports the best one. Even if all N strategies have zero true alpha, the best one will have a positive in-sample Sharpe simply due to chance. The expected maximum Sharpe ratio of N zero-alpha strategies is:
where n is the number of return observations. Testing 1,000 strategies on 10 years of daily data: E[max SR] ≈ √(2 × 6.9) / √2520 ≈ 0.074 daily ≈ 1.18 annualized. A purely random strategy search will produce Sharpe ratios above 1.0.
This is why reported backtest Sharpe ratios must be deflated by the number of strategies tested. A Sharpe of 2.0 sounds impressive until you learn it was the best of 10,000 variants. Harvey, Liu, and Zhu (2016) proposed the “haircut Sharpe ratio” that adjusts for this multiple testing bias.
5.2 Haircut Sharpe Ratios
The idea is simple: given the number of strategies tested (N), adjust the observed Sharpe ratio downward to account for selection bias. Two approaches:
Bonferroni Correction
Benjamini-Hochberg (FDR Control)
Rather than controlling the probability of any false discovery (FWER), control the expected proportion of false discoveries among those declared significant. This is less conservative and more appropriate when you expect some strategies genuinely have alpha.
Python import numpy as np from scipy import stats from statsmodels.stats.multitest import multipletests def haircut_sharpe(observed_sharpe, n_strategies, n_obs, method="bonferroni"): """ Adjust Sharpe ratio for multiple testing. Parameters ---------- observed_sharpe : float or array, annualized Sharpe ratios n_strategies : int, total number of strategies tested n_obs : int, number of return observations method : str, 'bonferroni' or 'fdr_bh' Returns ------- Adjusted significance assessment """ # Convert annualized Sharpe to t-statistic t_stats = np.atleast_1d(observed_sharpe) * np.sqrt(n_obs / 252) # Compute p-values (two-sided) p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n_obs - 1)) # Apply multiple testing correction reject, adjusted_p, _, _ = multipletests(p_values, method=method) return { "observed_sharpe": observed_sharpe, "t_stat": t_stats, "raw_p": p_values, "adjusted_p": adjusted_p, "significant": reject, } # Example: Test the best of 500 strategies n_strategies = 500 n_obs = 252 * 10 # 10 years of daily data # Suppose the best strategy has Sharpe = 1.8 result_bonf = haircut_sharpe(1.8, n_strategies, n_obs, "bonferroni") result_bh = haircut_sharpe(1.8, n_strategies, n_obs, "fdr_bh") print(f"Observed Sharpe: 1.8") print(f"Raw p-value: {result_bonf['raw_p'][0]:.6f}") print(f"Bonferroni adjusted p: {result_bonf['adjusted_p'][0]:.6f}") print(f" Significant? {result_bonf['significant'][0]}") print(f"BH (FDR) adjusted p: {result_bh['adjusted_p'][0]:.6f}") print(f" Significant? {result_bh['significant'][0]}") # Minimum Sharpe needed for significance for n_strat in [1, 10, 100, 500, 1000, 10000]: # Bonferroni: need p < 0.05 / n_strat adjusted_alpha = 0.05 / n_strat min_t = stats.t.ppf(1 - adjusted_alpha / 2, df=n_obs - 1) min_sharpe = min_t / np.sqrt(n_obs / 252) print(f"N={n_strat:>5d} strategies -> min Sharpe: {min_sharpe:.2f}")
6. Backtest Overfitting
6.1 In-Sample vs Out-of-Sample Degradation
The hallmark of overfitting is a large gap between in-sample (IS) and out-of-sample (OOS) performance. A strategy that earns Sharpe 2.5 in-sample but Sharpe 0.3 out-of-sample has learned the noise in the training period.
This is the exact same phenomenon as training error vs test error in machine learning. The gap between IS and OOS Sharpe ratios is analogous to the generalization gap. Models with more degrees of freedom (more parameters, more rules, more tunable thresholds) will have larger gaps. The principle of parsimony (Occam’s Razor) applies with full force.
6.2 The Probability of Backtest Overfitting (PBO)
Bailey et al. (2014) introduced a formal method to estimate the probability that a backtested strategy is overfit. The procedure:
- Partition the data into S blocks.
- Generate all combinatorially possible train/test splits using S/2 blocks for training and S/2 for testing.
- For each split, find the optimal strategy in-sample and measure its OOS performance.
- PBO = fraction of splits where the IS-optimal strategy has negative OOS Sharpe.
A PBO above 50% means the backtest is more likely overfit than not.
7. Walk-Forward Analysis: The Gold Standard
7.1 Procedure
- Divide the historical period into sequential windows.
- For each window, use the first portion for training (parameter fitting) and the remainder for testing.
- Advance the window by one step and repeat.
- Concatenate all OOS test periods to form a continuous OOS equity curve.
- Compute performance metrics only on the OOS portion.
Walk-forward analysis is the gold standard because every data point in the evaluation is truly out-of-sample: the model was trained only on data that preceded it. The resulting performance metrics are an unbiased estimate of what you would have earned in live trading. If the walk-forward Sharpe is 0.4 and the full-sample backtest Sharpe is 1.8, the strategy is overfit and the 0.4 figure is the honest one.
8. Common Backtesting Pitfalls
8.1 Look-Ahead Bias
Using information that was not available at the time of the trading decision. Examples:
- Using adjusted closing prices that incorporate future stock splits or dividends.
- Filtering the stock universe using future information (e.g., excluding stocks that later go bankrupt).
- Using fundamental data before its actual release date (earnings are reported weeks after quarter-end).
8.2 Survivorship Bias
Survivorship Bias — The error of analyzing only assets that survived to the present, ignoring those that were delisted, went bankrupt, or were acquired. This inflates historical performance because the worst performers are excluded.
Example: If you backtest a strategy on the current S&P 500 constituents going back 20 years, you are only testing on companies that survived 20 years. Companies that went bankrupt (Lehman Brothers, Enron, WorldCom) are excluded, removing the worst outcomes and biasing returns upward.
Survivorship bias is a form of selection bias — the sample is not representative of the population. In survival analysis, this is analogous to left truncation: you only observe individuals who survived long enough to enter the study. The solution is the same in both fields: use a survivorship-bias-free database that includes delisted securities.
8.3 Transaction Cost Neglect
A strategy that trades frequently looks great without costs but may be deeply unprofitable after commissions, slippage, and market impact. Key cost components:
| Cost Component | Typical Magnitude | Affected By |
|---|---|---|
| Commissions | $0.001 – $0.005 per share | Broker, volume |
| Bid-ask spread | 0.01% – 0.50% per trade | Liquidity, market cap |
| Market impact | 0.05% – 1%+ per trade | Order size relative to volume |
| Slippage | 0.01% – 0.10% per trade | Execution speed, volatility |
8.4 Overfitting to a Specific Period
A strategy optimized on 2010–2020 may have learned patterns specific to the post-GFC low-rate, low-volatility, QE-driven bull market. It will likely fail in a different regime (rising rates, high inflation, bear market). This is the non-stationarity problem from Module 16 applied to strategy design.
The most insidious form of overfitting is implicit: the researcher adjusts the strategy based on visual inspection of the equity curve, adds filters to avoid specific drawdowns, or changes parameters until the backtest “looks right.” Each adjustment is an implicit test that inflates the total number of strategies tried, even if only one is formally reported. Always count your degrees of freedom honestly.
9. Python: Complete Momentum Strategy Backtest
Python import numpy as np import pandas as pd import yfinance as yf from scipy import stats # ────────────────────────────────────────────── # Step 1: Download data # ────────────────────────────────────────────── spy = yf.download("SPY", start="2005-01-01", end="2023-12-31") returns = spy["Adj Close"].pct_change().dropna() # ────────────────────────────────────────────── # Step 2: Simple momentum strategy # Long SPY when 50-day return > 0, else flat (cash) # ────────────────────────────────────────────── lookback = 50 momentum_signal = returns.rolling(lookback).mean().shift(1) # Shifted to avoid look-ahead position = (momentum_signal > 0).astype(float) # Strategy returns (assume 5bps transaction cost per trade) trades = position.diff().abs() tc_per_trade = 0.0005 strategy_returns = position * returns - trades * tc_per_trade strategy_returns = strategy_returns.dropna() # Benchmark: buy and hold benchmark_returns = returns.loc[strategy_returns.index] # ────────────────────────────────────────────── # Step 3: Compute all performance metrics # ────────────────────────────────────────────── def compute_metrics(returns, name="Strategy", rf_annual=0.02): """Compute comprehensive backtest metrics.""" rf_daily = rf_annual / 252 excess = returns - rf_daily n = len(returns) # Basic stats ann_return = returns.mean() * 252 ann_vol = returns.std() * np.sqrt(252) # Sharpe ratio sharpe = excess.mean() / excess.std() * np.sqrt(252) # Sortino ratio downside = excess[excess < 0] downside_vol = np.sqrt((excess.clip(upper=0)**2).mean()) * np.sqrt(252) sortino = excess.mean() * 252 / downside_vol if downside_vol > 0 else np.nan # Maximum drawdown cum_returns = (1 + returns).cumprod() running_max = cum_returns.cummax() drawdown = (cum_returns - running_max) / running_max max_dd = drawdown.min() # Calmar ratio calmar = ann_return / abs(max_dd) if max_dd != 0 else np.nan # Win rate win_rate = (returns > 0).mean() # Statistical significance t_stat = excess.mean() / (excess.std() / np.sqrt(n)) p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-1)) # Skewness and kurtosis skew = returns.skew() kurt = returns.kurtosis() # Maximum drawdown duration dd_duration = 0 max_dd_duration = 0 for d in drawdown: if d < 0: dd_duration += 1 max_dd_duration = max(max_dd_duration, dd_duration) else: dd_duration = 0 return pd.Series({ "Ann. Return": f"{ann_return:.2%}", "Ann. Volatility": f"{ann_vol:.2%}", "Sharpe Ratio": f"{sharpe:.3f}", "Sortino Ratio": f"{sortino:.3f}", "Max Drawdown": f"{max_dd:.2%}", "Calmar Ratio": f"{calmar:.3f}", "Win Rate": f"{win_rate:.2%}", "Skewness": f"{skew:.3f}", "Kurtosis": f"{kurt:.3f}", "t-statistic": f"{t_stat:.3f}", "p-value": f"{p_value:.4f}", "Max DD Duration": f"{max_dd_duration} days", }, name=name) # Compare strategy vs benchmark metrics = pd.DataFrame({ "Momentum": compute_metrics(strategy_returns, "Momentum"), "Buy & Hold": compute_metrics(benchmark_returns, "Buy & Hold"), }) print(metrics) # ────────────────────────────────────────────── # Step 4: Apply multiple testing correction # Suppose we tested 20 lookback periods (10, 20, ..., 200) # ────────────────────────────────────────────── from statsmodels.stats.multitest import multipletests lookbacks = range(10, 210, 10) sharpes = [] pvals = [] for lb in lookbacks: sig = returns.rolling(lb).mean().shift(1) pos = (sig > 0).astype(float) tr = pos.diff().abs() strat_ret = (pos * returns - tr * tc_per_trade).dropna() excess = strat_ret - 0.02/252 sr = excess.mean() / excess.std() * np.sqrt(252) t = excess.mean() / (excess.std() / np.sqrt(len(excess))) p = 2 * (1 - stats.t.cdf(abs(t), df=len(excess)-1)) sharpes.append(sr) pvals.append(p) # Apply BH correction reject_bh, adjusted_p, _, _ = multipletests(pvals, method="fdr_bh") results = pd.DataFrame({ "Lookback": list(lookbacks), "Sharpe": sharpes, "Raw p-value": pvals, "BH adjusted p": adjusted_p, "Significant (BH)": reject_bh, }) print("\n=== Multiple Testing Correction ===") print(results.to_string(index=False))
10. Chapter Summary
| Statistics Concept | Backtesting Application | Key Warning |
|---|---|---|
| Signal-to-noise ratio | Sharpe ratio | Requires years to confirm statistically |
| Lower partial moment | Sortino ratio | Differentiates upside from downside risk |
| Path statistic / running max | Maximum drawdown | Captures investor experience, not just distribution |
| Multiple hypothesis testing | Strategy selection correction | Testing 1000 strategies guarantees false discoveries |
| Training vs test error gap | IS vs OOS degradation | Large gap = overfitting |
| Selection bias | Survivorship bias | Use survivorship-free databases |
| Temporal cross-validation | Walk-forward analysis | The only honest evaluation method |
The single most important question to ask about any backtest: how many strategies were tested to find this one? The answer determines whether the reported Sharpe ratio is genuine alpha or statistical noise. A Sharpe of 1.5 from the first strategy you tested is profoundly different from a Sharpe of 1.5 that was the best of 10,000 candidates. The statistics are identical; only the interpretation changes.