Learn Without Walls

Module 18: Backtesting & Strategy Evaluation

Judging investment strategies with the rigor of statistical hypothesis testing

Part IV of 5 Module 18 of 22

Introduction: Judging Strategies Like a Statistician

Backtesting is the process of evaluating a trading strategy on historical data. It is the financial equivalent of out-of-sample model validation — but with far more ways to go wrong. A beautifully backtested strategy that earned 40% per year with a Sharpe of 3.0 might be nothing more than an artifact of overfitting, data snooping, and biased simulation. This module equips you to evaluate strategies with the same rigor you would apply to a statistical study.

Stats Bridge

Backtesting is hypothesis testing applied to investment strategies. The null hypothesis is that the strategy has no alpha (excess risk-adjusted return). The test statistic is the Sharpe ratio. The critical question is always: is the observed performance statistically distinguishable from luck? Every concept in this module maps to a familiar statistical idea.

1. The Sharpe Ratio: Finance's Signal-to-Noise Ratio

1.1 Definition

SR = E[Rp − Rf] / σ(Rp − Rf) = μexcess / σexcess

where Rp is the portfolio return, Rf is the risk-free rate, μexcess is the mean excess return, and σexcess is the standard deviation of excess returns.

Stats Bridge

The Sharpe ratio is literally the signal-to-noise ratio (SNR) of the excess return process. It is also proportional to the t-statistic for testing whether the mean excess return is significantly different from zero: t = SR × √n, where n is the number of observations. A Sharpe ratio of 0.5 per year with 10 years of data gives t = 0.5 × √10 ≈ 1.58, which is not significant at the 5% level.

1.2 Interpreting the Sharpe Ratio

Annualized SharpeQualityContext
< 0.5PoorNot distinguishable from random noise
0.5 – 1.0AcceptableTypical for long-only equity strategies
1.0 – 2.0GoodAchievable by well-designed quantitative strategies
2.0 – 3.0ExcellentTop-tier hedge funds; warrants scrutiny for overfitting
> 3.0SuspiciousVery likely backtest artifact; almost never achieved live

1.3 Annualizing the Sharpe Ratio

SRannual = SRdaily × √252

This scaling assumes returns are independently and identically distributed (IID). The √252 factor comes from the fact that the variance of the sum of n IID variables scales as n, so the standard deviation scales as √n, while the mean scales as n. Thus the ratio of mean to standard deviation scales as √n.

Common Pitfall

The IID assumption behind √252 scaling is wrong for most strategies. If returns are positively autocorrelated (momentum strategies), the annualized Sharpe is overstated. If returns are negatively autocorrelated (mean-reversion strategies), it is understated. The correction factor involves the autocorrelation structure: SRannual = SRdaily × √(252 × (1 + 2∑(1−k/252)ρk)).

1.4 Statistical Significance of the Sharpe Ratio

Under the null hypothesis that the true Sharpe ratio is zero, the estimated Sharpe ratio follows approximately:

SR̂ ∼ N(0, 1/n × (1 + ½ SR2))

Simplified: SE(SR̂) ≈ 1/√n    (for small SR)

This means you need years of data to confirm a genuine Sharpe ratio. With daily data:

True Annual SharpeDaily SharpeYears for t=2 (p<0.05)
0.50.03116 years
1.00.0634 years
2.00.1261 year
Key Insight

A strategy with an annualized Sharpe of 0.5 requires 16 years of daily data to be statistically significant. Most backtests span 5–10 years. This means that for many realistic Sharpe ratios, we literally do not have enough data to distinguish signal from noise. This sobering fact underlies the entire difficulty of strategy evaluation.

2. The Sortino Ratio: Penalizing Only Downside

2.1 Motivation

The Sharpe ratio treats upside and downside volatility equally. But investors do not care about upside volatility — they only fear losses. The Sortino ratio replaces total standard deviation with downside deviation.

Sortino = E[Rp − Rf] / σdownside

where σdownside = √(E[min(Rp − Rf, 0)2])
Stats Bridge

The Sortino ratio is the SNR computed using only the lower partial moment of the distribution. If returns are symmetric (normal), Sortino ≈ Sharpe × √2. If returns are negatively skewed (common in finance), Sortino < Sharpe × √2, meaning the downside is worse than the symmetric case would suggest. A strategy that has Sharpe = 1.5 but Sortino = 1.0 is generating its returns with significant left-tail risk.

3. Maximum Drawdown: A Path-Dependent Risk Measure

3.1 Definition

Maximum drawdown (MDD) is the largest peak-to-trough decline in portfolio value over the entire backtest period.

MDD = maxt (maxs≤t Vs − Vt) / maxs≤t Vs
Finance Term

Drawdown — The decline from a previous peak in portfolio value. It measures how much an investor would have lost if they bought at the peak and sold at the trough. Maximum drawdown is the worst such experience over the entire backtest.

3.2 Why Drawdown Matters More Than Volatility

Volatility is a statistical abstraction; drawdown is a lived experience. A strategy with 15% annualized volatility sounds moderate, but if it experiences a 45% drawdown, most investors will panic-sell at the bottom. Key relationships:

3.3 Calmar Ratio

Calmar = Annualized Return / Maximum Drawdown

The Calmar ratio rewards returns and penalizes drawdowns. A Calmar above 1.0 is considered good; above 2.0 is excellent. Like the Sharpe ratio, very high Calmar ratios in a backtest should trigger suspicion of overfitting.

Stats Bridge

Maximum drawdown is a path statistic — it depends on the entire trajectory, not just the marginal distribution. Formally, it is related to the running maximum of a stochastic process. For a Brownian motion with drift μ and volatility σ, the distribution of maximum drawdown can be computed analytically, providing a benchmark to assess whether an observed drawdown is unusually large or small.

4. Complete Performance Metrics Reference

MetricFormulaWhat It MeasuresStatistical Analogue
Sharpe Ratioμ / σReturn per unit of total riskSignal-to-noise ratio; t-stat / √n
Sortino Ratioμ / σdownReturn per unit of downside riskSNR using lower partial moment
Calmar RatioAnn. Return / MDDReturn per unit of worst drawdownMean / path maximum statistic
Information Ratioα / σ(α)Active return per tracking errort-stat of regression intercept
Omega Ratio∫ gains / ∫ lossesProbability-weighted gain/lossRatio of upper to lower partial moments
Win RateP(R > 0)Fraction of profitable daysBinomial parameter p
Profit Factor∑ wins / ∑ lossesTotal profit relative to total lossRatio of positive to negative mean
SkewnessE[(R − μ)³] / σ³Asymmetry of returnsThird standardized moment
KurtosisE[(R − μ)⁴] / σ⁴Tail heavinessFourth standardized moment

5. The Multiple Testing Problem in Strategy Search

5.1 The Setup

A quantitative researcher tests N candidate strategies and reports the best one. Even if all N strategies have zero true alpha, the best one will have a positive in-sample Sharpe simply due to chance. The expected maximum Sharpe ratio of N zero-alpha strategies is:

E[max SR] ≈ √(2 ln(N)) / √n

where n is the number of return observations. Testing 1,000 strategies on 10 years of daily data: E[max SR] ≈ √(2 × 6.9) / √2520 ≈ 0.074 daily ≈ 1.18 annualized. A purely random strategy search will produce Sharpe ratios above 1.0.

Key Insight

This is why reported backtest Sharpe ratios must be deflated by the number of strategies tested. A Sharpe of 2.0 sounds impressive until you learn it was the best of 10,000 variants. Harvey, Liu, and Zhu (2016) proposed the “haircut Sharpe ratio” that adjusts for this multiple testing bias.

5.2 Haircut Sharpe Ratios

The idea is simple: given the number of strategies tested (N), adjust the observed Sharpe ratio downward to account for selection bias. Two approaches:

Bonferroni Correction

SRadjusted = SRobserved − Φ−1(1 − α/(2N)) / √n

Benjamini-Hochberg (FDR Control)

Rather than controlling the probability of any false discovery (FWER), control the expected proportion of false discoveries among those declared significant. This is less conservative and more appropriate when you expect some strategies genuinely have alpha.

Python
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

def haircut_sharpe(observed_sharpe, n_strategies, n_obs, method="bonferroni"):
    """
    Adjust Sharpe ratio for multiple testing.

    Parameters
    ----------
    observed_sharpe : float or array, annualized Sharpe ratios
    n_strategies : int, total number of strategies tested
    n_obs : int, number of return observations
    method : str, 'bonferroni' or 'fdr_bh'

    Returns
    -------
    Adjusted significance assessment
    """
    # Convert annualized Sharpe to t-statistic
    t_stats = np.atleast_1d(observed_sharpe) * np.sqrt(n_obs / 252)

    # Compute p-values (two-sided)
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n_obs - 1))

    # Apply multiple testing correction
    reject, adjusted_p, _, _ = multipletests(p_values, method=method)

    return {
        "observed_sharpe": observed_sharpe,
        "t_stat": t_stats,
        "raw_p": p_values,
        "adjusted_p": adjusted_p,
        "significant": reject,
    }

# Example: Test the best of 500 strategies
n_strategies = 500
n_obs = 252 * 10  # 10 years of daily data

# Suppose the best strategy has Sharpe = 1.8
result_bonf = haircut_sharpe(1.8, n_strategies, n_obs, "bonferroni")
result_bh = haircut_sharpe(1.8, n_strategies, n_obs, "fdr_bh")

print(f"Observed Sharpe: 1.8")
print(f"Raw p-value: {result_bonf['raw_p'][0]:.6f}")
print(f"Bonferroni adjusted p: {result_bonf['adjusted_p'][0]:.6f}")
print(f"  Significant? {result_bonf['significant'][0]}")
print(f"BH (FDR) adjusted p: {result_bh['adjusted_p'][0]:.6f}")
print(f"  Significant? {result_bh['significant'][0]}")

# Minimum Sharpe needed for significance
for n_strat in [1, 10, 100, 500, 1000, 10000]:
    # Bonferroni: need p < 0.05 / n_strat
    adjusted_alpha = 0.05 / n_strat
    min_t = stats.t.ppf(1 - adjusted_alpha / 2, df=n_obs - 1)
    min_sharpe = min_t / np.sqrt(n_obs / 252)
    print(f"N={n_strat:>5d} strategies -> min Sharpe: {min_sharpe:.2f}")

6. Backtest Overfitting

6.1 In-Sample vs Out-of-Sample Degradation

The hallmark of overfitting is a large gap between in-sample (IS) and out-of-sample (OOS) performance. A strategy that earns Sharpe 2.5 in-sample but Sharpe 0.3 out-of-sample has learned the noise in the training period.

Stats Bridge

This is the exact same phenomenon as training error vs test error in machine learning. The gap between IS and OOS Sharpe ratios is analogous to the generalization gap. Models with more degrees of freedom (more parameters, more rules, more tunable thresholds) will have larger gaps. The principle of parsimony (Occam’s Razor) applies with full force.

6.2 The Probability of Backtest Overfitting (PBO)

Bailey et al. (2014) introduced a formal method to estimate the probability that a backtested strategy is overfit. The procedure:

  1. Partition the data into S blocks.
  2. Generate all combinatorially possible train/test splits using S/2 blocks for training and S/2 for testing.
  3. For each split, find the optimal strategy in-sample and measure its OOS performance.
  4. PBO = fraction of splits where the IS-optimal strategy has negative OOS Sharpe.

A PBO above 50% means the backtest is more likely overfit than not.

7. Walk-Forward Analysis: The Gold Standard

7.1 Procedure

  1. Divide the historical period into sequential windows.
  2. For each window, use the first portion for training (parameter fitting) and the remainder for testing.
  3. Advance the window by one step and repeat.
  4. Concatenate all OOS test periods to form a continuous OOS equity curve.
  5. Compute performance metrics only on the OOS portion.
Key Insight

Walk-forward analysis is the gold standard because every data point in the evaluation is truly out-of-sample: the model was trained only on data that preceded it. The resulting performance metrics are an unbiased estimate of what you would have earned in live trading. If the walk-forward Sharpe is 0.4 and the full-sample backtest Sharpe is 1.8, the strategy is overfit and the 0.4 figure is the honest one.

8. Common Backtesting Pitfalls

8.1 Look-Ahead Bias

Using information that was not available at the time of the trading decision. Examples:

8.2 Survivorship Bias

Finance Term

Survivorship Bias — The error of analyzing only assets that survived to the present, ignoring those that were delisted, went bankrupt, or were acquired. This inflates historical performance because the worst performers are excluded.

Example: If you backtest a strategy on the current S&P 500 constituents going back 20 years, you are only testing on companies that survived 20 years. Companies that went bankrupt (Lehman Brothers, Enron, WorldCom) are excluded, removing the worst outcomes and biasing returns upward.

Stats Bridge

Survivorship bias is a form of selection bias — the sample is not representative of the population. In survival analysis, this is analogous to left truncation: you only observe individuals who survived long enough to enter the study. The solution is the same in both fields: use a survivorship-bias-free database that includes delisted securities.

8.3 Transaction Cost Neglect

A strategy that trades frequently looks great without costs but may be deeply unprofitable after commissions, slippage, and market impact. Key cost components:

Cost ComponentTypical MagnitudeAffected By
Commissions$0.001 – $0.005 per shareBroker, volume
Bid-ask spread0.01% – 0.50% per tradeLiquidity, market cap
Market impact0.05% – 1%+ per tradeOrder size relative to volume
Slippage0.01% – 0.10% per tradeExecution speed, volatility

8.4 Overfitting to a Specific Period

A strategy optimized on 2010–2020 may have learned patterns specific to the post-GFC low-rate, low-volatility, QE-driven bull market. It will likely fail in a different regime (rising rates, high inflation, bear market). This is the non-stationarity problem from Module 16 applied to strategy design.

Common Pitfall

The most insidious form of overfitting is implicit: the researcher adjusts the strategy based on visual inspection of the equity curve, adds filters to avoid specific drawdowns, or changes parameters until the backtest “looks right.” Each adjustment is an implicit test that inflates the total number of strategies tried, even if only one is formally reported. Always count your degrees of freedom honestly.

9. Python: Complete Momentum Strategy Backtest

Python
import numpy as np
import pandas as pd
import yfinance as yf
from scipy import stats

# ──────────────────────────────────────────────
# Step 1: Download data
# ──────────────────────────────────────────────
spy = yf.download("SPY", start="2005-01-01", end="2023-12-31")
returns = spy["Adj Close"].pct_change().dropna()

# ──────────────────────────────────────────────
# Step 2: Simple momentum strategy
#   Long SPY when 50-day return > 0, else flat (cash)
# ──────────────────────────────────────────────
lookback = 50
momentum_signal = returns.rolling(lookback).mean().shift(1)  # Shifted to avoid look-ahead
position = (momentum_signal > 0).astype(float)

# Strategy returns (assume 5bps transaction cost per trade)
trades = position.diff().abs()
tc_per_trade = 0.0005
strategy_returns = position * returns - trades * tc_per_trade
strategy_returns = strategy_returns.dropna()

# Benchmark: buy and hold
benchmark_returns = returns.loc[strategy_returns.index]

# ──────────────────────────────────────────────
# Step 3: Compute all performance metrics
# ──────────────────────────────────────────────
def compute_metrics(returns, name="Strategy", rf_annual=0.02):
    """Compute comprehensive backtest metrics."""
    rf_daily = rf_annual / 252
    excess = returns - rf_daily
    n = len(returns)

    # Basic stats
    ann_return = returns.mean() * 252
    ann_vol = returns.std() * np.sqrt(252)

    # Sharpe ratio
    sharpe = excess.mean() / excess.std() * np.sqrt(252)

    # Sortino ratio
    downside = excess[excess < 0]
    downside_vol = np.sqrt((excess.clip(upper=0)**2).mean()) * np.sqrt(252)
    sortino = excess.mean() * 252 / downside_vol if downside_vol > 0 else np.nan

    # Maximum drawdown
    cum_returns = (1 + returns).cumprod()
    running_max = cum_returns.cummax()
    drawdown = (cum_returns - running_max) / running_max
    max_dd = drawdown.min()

    # Calmar ratio
    calmar = ann_return / abs(max_dd) if max_dd != 0 else np.nan

    # Win rate
    win_rate = (returns > 0).mean()

    # Statistical significance
    t_stat = excess.mean() / (excess.std() / np.sqrt(n))
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-1))

    # Skewness and kurtosis
    skew = returns.skew()
    kurt = returns.kurtosis()

    # Maximum drawdown duration
    dd_duration = 0
    max_dd_duration = 0
    for d in drawdown:
        if d < 0:
            dd_duration += 1
            max_dd_duration = max(max_dd_duration, dd_duration)
        else:
            dd_duration = 0

    return pd.Series({
        "Ann. Return": f"{ann_return:.2%}",
        "Ann. Volatility": f"{ann_vol:.2%}",
        "Sharpe Ratio": f"{sharpe:.3f}",
        "Sortino Ratio": f"{sortino:.3f}",
        "Max Drawdown": f"{max_dd:.2%}",
        "Calmar Ratio": f"{calmar:.3f}",
        "Win Rate": f"{win_rate:.2%}",
        "Skewness": f"{skew:.3f}",
        "Kurtosis": f"{kurt:.3f}",
        "t-statistic": f"{t_stat:.3f}",
        "p-value": f"{p_value:.4f}",
        "Max DD Duration": f"{max_dd_duration} days",
    }, name=name)

# Compare strategy vs benchmark
metrics = pd.DataFrame({
    "Momentum": compute_metrics(strategy_returns, "Momentum"),
    "Buy & Hold": compute_metrics(benchmark_returns, "Buy & Hold"),
})
print(metrics)

# ──────────────────────────────────────────────
# Step 4: Apply multiple testing correction
#   Suppose we tested 20 lookback periods (10, 20, ..., 200)
# ──────────────────────────────────────────────
from statsmodels.stats.multitest import multipletests

lookbacks = range(10, 210, 10)
sharpes = []
pvals = []

for lb in lookbacks:
    sig = returns.rolling(lb).mean().shift(1)
    pos = (sig > 0).astype(float)
    tr = pos.diff().abs()
    strat_ret = (pos * returns - tr * tc_per_trade).dropna()

    excess = strat_ret - 0.02/252
    sr = excess.mean() / excess.std() * np.sqrt(252)
    t = excess.mean() / (excess.std() / np.sqrt(len(excess)))
    p = 2 * (1 - stats.t.cdf(abs(t), df=len(excess)-1))
    sharpes.append(sr)
    pvals.append(p)

# Apply BH correction
reject_bh, adjusted_p, _, _ = multipletests(pvals, method="fdr_bh")

results = pd.DataFrame({
    "Lookback": list(lookbacks),
    "Sharpe": sharpes,
    "Raw p-value": pvals,
    "BH adjusted p": adjusted_p,
    "Significant (BH)": reject_bh,
})
print("\n=== Multiple Testing Correction ===")
print(results.to_string(index=False))

10. Chapter Summary

Statistics ConceptBacktesting ApplicationKey Warning
Signal-to-noise ratioSharpe ratioRequires years to confirm statistically
Lower partial momentSortino ratioDifferentiates upside from downside risk
Path statistic / running maxMaximum drawdownCaptures investor experience, not just distribution
Multiple hypothesis testingStrategy selection correctionTesting 1000 strategies guarantees false discoveries
Training vs test error gapIS vs OOS degradationLarge gap = overfitting
Selection biasSurvivorship biasUse survivorship-free databases
Temporal cross-validationWalk-forward analysisThe only honest evaluation method
Key Insight

The single most important question to ask about any backtest: how many strategies were tested to find this one? The answer determines whether the reported Sharpe ratio is genuine alpha or statistical noise. A Sharpe of 1.5 from the first strategy you tested is profoundly different from a Sharpe of 1.5 that was the best of 10,000 candidates. The statistics are identical; only the interpretation changes.