Module 18: Backtesting & Strategy Evaluation

Judging investment strategies with the rigor of statistical hypothesis testing

Part IV of 5 Module 18 of 22

← Previous Module 18 of 22 Next →

Introduction: Judging Strategies Like a Statistician

Backtesting is the process of evaluating a trading strategy on historical data. It is the financial equivalent of out-of-sample model validation — but with far more ways to go wrong. A beautifully backtested strategy that earned 40% per year with a Sharpe of 3.0 might be nothing more than an artifact of overfitting, data snooping, and biased simulation. This module equips you to evaluate strategies with the same rigor you would apply to a statistical study.

Stats Bridge

Backtesting is hypothesis testing applied to investment strategies. The null hypothesis is that the strategy has no alpha (excess risk-adjusted return). The test statistic is the Sharpe ratio. The critical question is always: is the observed performance statistically distinguishable from luck? Every concept in this module maps to a familiar statistical idea.

1. The Sharpe Ratio: Finance's Signal-to-Noise Ratio

1.1 Definition

SR = E[R_p − R_f] / σ(R_p − R_f) = μ_excess / σ_excess

where R_p is the portfolio return, R_f is the risk-free rate, μ_excess is the mean excess return, and σ_excess is the standard deviation of excess returns.

Stats Bridge

The Sharpe ratio is literally the signal-to-noise ratio (SNR) of the excess return process. It is also proportional to the t-statistic for testing whether the mean excess return is significantly different from zero: t = SR × √n, where n is the number of observations. A Sharpe ratio of 0.5 per year with 10 years of data gives t = 0.5 × √10 ≈ 1.58, which is not significant at the 5% level.

1.2 Interpreting the Sharpe Ratio

Annualized Sharpe	Quality	Context
< 0.5	Poor	Not distinguishable from random noise
0.5 – 1.0	Acceptable	Typical for long-only equity strategies
1.0 – 2.0	Good	Achievable by well-designed quantitative strategies
2.0 – 3.0	Excellent	Top-tier hedge funds; warrants scrutiny for overfitting
> 3.0	Suspicious	Very likely backtest artifact; almost never achieved live

1.3 Annualizing the Sharpe Ratio

SR_annual = SR_daily × √252

This scaling assumes returns are independently and identically distributed (IID). The √252 factor comes from the fact that the variance of the sum of n IID variables scales as n, so the standard deviation scales as √n, while the mean scales as n. Thus the ratio of mean to standard deviation scales as √n.

Common Pitfall

The IID assumption behind √252 scaling is wrong for most strategies. If returns are positively autocorrelated (momentum strategies), the annualized Sharpe is overstated. If returns are negatively autocorrelated (mean-reversion strategies), it is understated. The correction factor involves the autocorrelation structure: SR_annual = SR_daily × √(252 × (1 + 2∑(1−k/252)ρ_k)).

1.4 Statistical Significance of the Sharpe Ratio

Under the null hypothesis that the true Sharpe ratio is zero, the estimated Sharpe ratio follows approximately:

SR̂ ∼ N(0, 1/n × (1 + ½ SR²))

Simplified: SE(SR̂) ≈ 1/√n (for small SR)

This means you need years of data to confirm a genuine Sharpe ratio. With daily data:

True Annual Sharpe	Daily Sharpe	Years for t=2 (p<0.05)
0.5	0.031	16 years
1.0	0.063	4 years
2.0	0.126	1 year

Key Insight

A strategy with an annualized Sharpe of 0.5 requires 16 years of daily data to be statistically significant. Most backtests span 5–10 years. This means that for many realistic Sharpe ratios, we literally do not have enough data to distinguish signal from noise. This sobering fact underlies the entire difficulty of strategy evaluation.

2. The Sortino Ratio: Penalizing Only Downside

2.1 Motivation

The Sharpe ratio treats upside and downside volatility equally. But investors do not care about upside volatility — they only fear losses. The Sortino ratio replaces total standard deviation with downside deviation.

Sortino = E[R_p − R_f] / σ_downside

where σ_downside = √(E[min(R_p − R_f, 0)²])

Stats Bridge

The Sortino ratio is the SNR computed using only the lower partial moment of the distribution. If returns are symmetric (normal), Sortino ≈ Sharpe × √2. If returns are negatively skewed (common in finance), Sortino < Sharpe × √2, meaning the downside is worse than the symmetric case would suggest. A strategy that has Sharpe = 1.5 but Sortino = 1.0 is generating its returns with significant left-tail risk.

3. Maximum Drawdown: A Path-Dependent Risk Measure

3.1 Definition

Maximum drawdown (MDD) is the largest peak-to-trough decline in portfolio value over the entire backtest period.

MDD = max_t (max_s≤t V_s − V_t) / max_s≤t V_s

Finance Term

Drawdown — The decline from a previous peak in portfolio value. It measures how much an investor would have lost if they bought at the peak and sold at the trough. Maximum drawdown is the worst such experience over the entire backtest.

3.2 Why Drawdown Matters More Than Volatility

Volatility is a statistical abstraction; drawdown is a lived experience. A strategy with 15% annualized volatility sounds moderate, but if it experiences a 45% drawdown, most investors will panic-sell at the bottom. Key relationships:

For a random walk with annual volatility σ, the expected maximum drawdown over T years is approximately σ × √(2T × ln(T)).
Drawdowns are path-dependent: two return series with identical means and variances can have very different drawdown profiles.
Drawdown duration (how long it takes to recover) matters as much as drawdown depth.

3.3 Calmar Ratio

Calmar = Annualized Return / Maximum Drawdown

The Calmar ratio rewards returns and penalizes drawdowns. A Calmar above 1.0 is considered good; above 2.0 is excellent. Like the Sharpe ratio, very high Calmar ratios in a backtest should trigger suspicion of overfitting.

Stats Bridge

Maximum drawdown is a path statistic — it depends on the entire trajectory, not just the marginal distribution. Formally, it is related to the running maximum of a stochastic process. For a Brownian motion with drift μ and volatility σ, the distribution of maximum drawdown can be computed analytically, providing a benchmark to assess whether an observed drawdown is unusually large or small.

4. Complete Performance Metrics Reference

Metric	Formula	What It Measures	Statistical Analogue
Sharpe Ratio	μ / σ	Return per unit of total risk	Signal-to-noise ratio; t-stat / √n
Sortino Ratio	μ / σ_down	Return per unit of downside risk	SNR using lower partial moment
Calmar Ratio	Ann. Return / MDD	Return per unit of worst drawdown	Mean / path maximum statistic
Information Ratio	α / σ(α)	Active return per tracking error	t-stat of regression intercept
Omega Ratio	∫ gains / ∫ losses	Probability-weighted gain/loss	Ratio of upper to lower partial moments
Win Rate	P(R > 0)	Fraction of profitable days	Binomial parameter p
Profit Factor	∑ wins / ∑ losses	Total profit relative to total loss	Ratio of positive to negative mean
Skewness	E[(R − μ)³] / σ³	Asymmetry of returns	Third standardized moment
Kurtosis	E[(R − μ)⁴] / σ⁴	Tail heaviness	Fourth standardized moment

5. The Multiple Testing Problem in Strategy Search

5.1 The Setup

A quantitative researcher tests N candidate strategies and reports the best one. Even if all N strategies have zero true alpha, the best one will have a positive in-sample Sharpe simply due to chance. The expected maximum Sharpe ratio of N zero-alpha strategies is:

E[max SR] ≈ √(2 ln(N)) / √n

where n is the number of return observations. Testing 1,000 strategies on 10 years of daily data: E[max SR] ≈ √(2 × 6.9) / √2520 ≈ 0.074 daily ≈ 1.18 annualized. A purely random strategy search will produce Sharpe ratios above 1.0.

Key Insight

This is why reported backtest Sharpe ratios must be deflated by the number of strategies tested. A Sharpe of 2.0 sounds impressive until you learn it was the best of 10,000 variants. Harvey, Liu, and Zhu (2016) proposed the “haircut Sharpe ratio” that adjusts for this multiple testing bias.

5.2 Haircut Sharpe Ratios

The idea is simple: given the number of strategies tested (N), adjust the observed Sharpe ratio downward to account for selection bias. Two approaches:

Bonferroni Correction

SR_adjusted = SR_observed − Φ⁻¹(1 − α/(2N)) / √n

Benjamini-Hochberg (FDR Control)

Rather than controlling the probability of any false discovery (FWER), control the expected proportion of false discoveries among those declared significant. This is less conservative and more appropriate when you expect some strategies genuinely have alpha.

Python
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

def haircut_sharpe(observed_sharpe, n_strategies, n_obs, method="bonferroni"):
    """
    Adjust Sharpe ratio for multiple testing.

    Parameters
    ----------
    observed_sharpe : float or array, annualized Sharpe ratios
    n_strategies : int, total number of strategies tested
    n_obs : int, number of return observations
    method : str, 'bonferroni' or 'fdr_bh'

    Returns
    -------
    Adjusted significance assessment
    """
    # Convert annualized Sharpe to t-statistic
    t_stats = np.atleast_1d(observed_sharpe) * np.sqrt(n_obs / 252)

    # Compute p-values (two-sided)
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n_obs - 1))

    # Apply multiple testing correction
    reject, adjusted_p, _, _ = multipletests(p_values, method=method)

    return {
        "observed_sharpe": observed_sharpe,
        "t_stat": t_stats,
        "raw_p": p_values,
        "adjusted_p": adjusted_p,
        "significant": reject,
    }

# Example: Test the best of 500 strategies
n_strategies = 500
n_obs = 252 * 10  # 10 years of daily data

# Suppose the best strategy has Sharpe = 1.8
result_bonf = haircut_sharpe(1.8, n_strategies, n_obs, "bonferroni")
result_bh = haircut_sharpe(1.8, n_strategies, n_obs, "fdr_bh")

print(f"Observed Sharpe: 1.8")
print(f"Raw p-value: {result_bonf['raw_p'][0]:.6f}")
print(f"Bonferroni adjusted p: {result_bonf['adjusted_p'][0]:.6f}")
print(f"  Significant? {result_bonf['significant'][0]}")
print(f"BH (FDR) adjusted p: {result_bh['adjusted_p'][0]:.6f}")
print(f"  Significant? {result_bh['significant'][0]}")

# Minimum Sharpe needed for significance
for n_strat in [1, 10, 100, 500, 1000, 10000]:
    # Bonferroni: need p < 0.05 / n_strat
    adjusted_alpha = 0.05 / n_strat
    min_t = stats.t.ppf(1 - adjusted_alpha / 2, df=n_obs - 1)
    min_sharpe = min_t / np.sqrt(n_obs / 252)
    print(f"N={n_strat:>5d} strategies -> min Sharpe: {min_sharpe:.2f}")

6. Backtest Overfitting

6.1 In-Sample vs Out-of-Sample Degradation

The hallmark of overfitting is a large gap between in-sample (IS) and out-of-sample (OOS) performance. A strategy that earns Sharpe 2.5 in-sample but Sharpe 0.3 out-of-sample has learned the noise in the training period.

Stats Bridge

This is the exact same phenomenon as training error vs test error in machine learning. The gap between IS and OOS Sharpe ratios is analogous to the generalization gap. Models with more degrees of freedom (more parameters, more rules, more tunable thresholds) will have larger gaps. The principle of parsimony (Occam’s Razor) applies with full force.

6.2 The Probability of Backtest Overfitting (PBO)

Bailey et al. (2014) introduced a formal method to estimate the probability that a backtested strategy is overfit. The procedure:

Partition the data into S blocks.
Generate all combinatorially possible train/test splits using S/2 blocks for training and S/2 for testing.
For each split, find the optimal strategy in-sample and measure its OOS performance.
PBO = fraction of splits where the IS-optimal strategy has negative OOS Sharpe.

A PBO above 50% means the backtest is more likely overfit than not.

7. Walk-Forward Analysis: The Gold Standard

7.1 Procedure

Divide the historical period into sequential windows.
For each window, use the first portion for training (parameter fitting) and the remainder for testing.
Advance the window by one step and repeat.
Concatenate all OOS test periods to form a continuous OOS equity curve.
Compute performance metrics only on the OOS portion.

Key Insight

Walk-forward analysis is the gold standard because every data point in the evaluation is truly out-of-sample: the model was trained only on data that preceded it. The resulting performance metrics are an unbiased estimate of what you would have earned in live trading. If the walk-forward Sharpe is 0.4 and the full-sample backtest Sharpe is 1.8, the strategy is overfit and the 0.4 figure is the honest one.

8. Common Backtesting Pitfalls

8.1 Look-Ahead Bias

Using information that was not available at the time of the trading decision. Examples:

Using adjusted closing prices that incorporate future stock splits or dividends.
Filtering the stock universe using future information (e.g., excluding stocks that later go bankrupt).
Using fundamental data before its actual release date (earnings are reported weeks after quarter-end).

8.2 Survivorship Bias

Finance Term

Survivorship Bias — The error of analyzing only assets that survived to the present, ignoring those that were delisted, went bankrupt, or were acquired. This inflates historical performance because the worst performers are excluded.

Example: If you backtest a strategy on the current S&P 500 constituents going back 20 years, you are only testing on companies that survived 20 years. Companies that went bankrupt (Lehman Brothers, Enron, WorldCom) are excluded, removing the worst outcomes and biasing returns upward.

Stats Bridge

Survivorship bias is a form of selection bias — the sample is not representative of the population. In survival analysis, this is analogous to left truncation: you only observe individuals who survived long enough to enter the study. The solution is the same in both fields: use a survivorship-bias-free database that includes delisted securities.

8.3 Transaction Cost Neglect

A strategy that trades frequently looks great without costs but may be deeply unprofitable after commissions, slippage, and market impact. Key cost components:

Cost Component	Typical Magnitude	Affected By
Commissions	$0.001 – $0.005 per share	Broker, volume
Bid-ask spread	0.01% – 0.50% per trade	Liquidity, market cap
Market impact	0.05% – 1%+ per trade	Order size relative to volume
Slippage	0.01% – 0.10% per trade	Execution speed, volatility

8.4 Overfitting to a Specific Period

A strategy optimized on 2010–2020 may have learned patterns specific to the post-GFC low-rate, low-volatility, QE-driven bull market. It will likely fail in a different regime (rising rates, high inflation, bear market). This is the non-stationarity problem from Module 16 applied to strategy design.

Common Pitfall

The most insidious form of overfitting is implicit: the researcher adjusts the strategy based on visual inspection of the equity curve, adds filters to avoid specific drawdowns, or changes parameters until the backtest “looks right.” Each adjustment is an implicit test that inflates the total number of strategies tried, even if only one is formally reported. Always count your degrees of freedom honestly.

9. Python: Complete Momentum Strategy Backtest

Python
import numpy as np
import pandas as pd
import yfinance as yf
from scipy import stats

# ──────────────────────────────────────────────
# Step 1: Download data
# ──────────────────────────────────────────────
spy = yf.download("SPY", start="2005-01-01", end="2023-12-31")
returns = spy["Adj Close"].pct_change().dropna()

# ──────────────────────────────────────────────
# Step 2: Simple momentum strategy
#   Long SPY when 50-day return > 0, else flat (cash)
# ──────────────────────────────────────────────
lookback = 50
momentum_signal = returns.rolling(lookback).mean().shift(1)  # Shifted to avoid look-ahead
position = (momentum_signal > 0).astype(float)

# Strategy returns (assume 5bps transaction cost per trade)
trades = position.diff().abs()
tc_per_trade = 0.0005
strategy_returns = position * returns - trades * tc_per_trade
strategy_returns = strategy_returns.dropna()

# Benchmark: buy and hold
benchmark_returns = returns.loc[strategy_returns.index]

# ──────────────────────────────────────────────
# Step 3: Compute all performance metrics
# ──────────────────────────────────────────────
def compute_metrics(returns, name="Strategy", rf_annual=0.02):
    """Compute comprehensive backtest metrics."""
    rf_daily = rf_annual / 252
    excess = returns - rf_daily
    n = len(returns)

    # Basic stats
    ann_return = returns.mean() * 252
    ann_vol = returns.std() * np.sqrt(252)

    # Sharpe ratio
    sharpe = excess.mean() / excess.std() * np.sqrt(252)

    # Sortino ratio
    downside = excess[excess < 0]
    downside_vol = np.sqrt((excess.clip(upper=0)**2).mean()) * np.sqrt(252)
    sortino = excess.mean() * 252 / downside_vol if downside_vol > 0 else np.nan

    # Maximum drawdown
    cum_returns = (1 + returns).cumprod()
    running_max = cum_returns.cummax()
    drawdown = (cum_returns - running_max) / running_max
    max_dd = drawdown.min()

    # Calmar ratio
    calmar = ann_return / abs(max_dd) if max_dd != 0 else np.nan

    # Win rate
    win_rate = (returns > 0).mean()

    # Statistical significance
    t_stat = excess.mean() / (excess.std() / np.sqrt(n))
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-1))

    # Skewness and kurtosis
    skew = returns.skew()
    kurt = returns.kurtosis()

    # Maximum drawdown duration
    dd_duration = 0
    max_dd_duration = 0
    for d in drawdown:
        if d < 0:
            dd_duration += 1
            max_dd_duration = max(max_dd_duration, dd_duration)
        else:
            dd_duration = 0

    return pd.Series({
        "Ann. Return": f"{ann_return:.2%}",
        "Ann. Volatility": f"{ann_vol:.2%}",
        "Sharpe Ratio": f"{sharpe:.3f}",
        "Sortino Ratio": f"{sortino:.3f}",
        "Max Drawdown": f"{max_dd:.2%}",
        "Calmar Ratio": f"{calmar:.3f}",
        "Win Rate": f"{win_rate:.2%}",
        "Skewness": f"{skew:.3f}",
        "Kurtosis": f"{kurt:.3f}",
        "t-statistic": f"{t_stat:.3f}",
        "p-value": f"{p_value:.4f}",
        "Max DD Duration": f"{max_dd_duration} days",
    }, name=name)

# Compare strategy vs benchmark
metrics = pd.DataFrame({
    "Momentum": compute_metrics(strategy_returns, "Momentum"),
    "Buy & Hold": compute_metrics(benchmark_returns, "Buy & Hold"),
})
print(metrics)

# ──────────────────────────────────────────────
# Step 4: Apply multiple testing correction
#   Suppose we tested 20 lookback periods (10, 20, ..., 200)
# ──────────────────────────────────────────────
from statsmodels.stats.multitest import multipletests

lookbacks = range(10, 210, 10)
sharpes = []
pvals = []

for lb in lookbacks:
    sig = returns.rolling(lb).mean().shift(1)
    pos = (sig > 0).astype(float)
    tr = pos.diff().abs()
    strat_ret = (pos * returns - tr * tc_per_trade).dropna()

    excess = strat_ret - 0.02/252
    sr = excess.mean() / excess.std() * np.sqrt(252)
    t = excess.mean() / (excess.std() / np.sqrt(len(excess)))
    p = 2 * (1 - stats.t.cdf(abs(t), df=len(excess)-1))
    sharpes.append(sr)
    pvals.append(p)

# Apply BH correction
reject_bh, adjusted_p, _, _ = multipletests(pvals, method="fdr_bh")

results = pd.DataFrame({
    "Lookback": list(lookbacks),
    "Sharpe": sharpes,
    "Raw p-value": pvals,
    "BH adjusted p": adjusted_p,
    "Significant (BH)": reject_bh,
})
print("\n=== Multiple Testing Correction ===")
print(results.to_string(index=False))

10. Chapter Summary

Statistics Concept	Backtesting Application	Key Warning
Signal-to-noise ratio	Sharpe ratio	Requires years to confirm statistically
Lower partial moment	Sortino ratio	Differentiates upside from downside risk
Path statistic / running max	Maximum drawdown	Captures investor experience, not just distribution
Multiple hypothesis testing	Strategy selection correction	Testing 1000 strategies guarantees false discoveries
Training vs test error gap	IS vs OOS degradation	Large gap = overfitting
Selection bias	Survivorship bias	Use survivorship-free databases
Temporal cross-validation	Walk-forward analysis	The only honest evaluation method

Key Insight

The single most important question to ask about any backtest: how many strategies were tested to find this one? The answer determines whether the reported Sharpe ratio is genuine alpha or statistical noise. A Sharpe of 1.5 from the first strategy you tested is profoundly different from a Sharpe of 1.5 that was the best of 10,000 candidates. The statistics are identical; only the interpretation changes.

← Previous Course Home Next →