Module 05: Missing Data & Survivorship Bias
The data you do not see can bias everything you measure
1. Introduction: The Data You Do Not See
Every statistician knows that missing data can bias estimates. What makes financial data special is that the missing data is systematically missing in ways that are directly correlated with the quantity you are trying to measure. The companies that disappeared from the stock market are disproportionately the ones that failed. The hedge funds that stopped reporting are disproportionately the ones that performed poorly. The trading strategies that never made it into textbooks are disproportionately the ones that did not work.
This module applies the statistical framework of missing data — MCAR, MAR, MNAR — to financial datasets, and quantifies how large the resulting biases can be.
2. Survivorship Bias: The Silent Killer of Backtests
2.1 What Is Survivorship Bias?
Imagine you want to evaluate the average performance of US stocks over the past 20 years. If you download today’s S&P 500 constituents and look up their historical returns, you are only looking at the 500 companies that are successful today. Companies that went bankrupt, were acquired at distressed prices, or shrank out of the index are excluded. Your sample is conditioned on the outcome variable — a textbook selection bias.
2.2 The Magnitude of the Bias
Academic research has quantified the survivorship bias in various contexts:
| Context | Estimated Bias | Source |
|---|---|---|
| US mutual fund average returns | +0.9% to +1.5% per year | Elton, Gruber, Blake (1996) |
| Hedge fund database returns | +1.4% to +3.6% per year | Malkiel, Saha (2005) |
| US stock universe average returns | +0.5% to +1.0% per year | Various studies using CRSP |
| International equity indices | +1.0% to +2.5% per year | Dimson, Marsh, Staunton (2002) |
2.3 A Concrete Example
Consider the Dow Jones Industrial Average. In 2005, the DJIA included General Motors (GM). By 2009, GM had filed for bankruptcy and was removed from the index. If you backtest a strategy on the “current DJIA constituents” going back to 2005, you would not include GM’s catastrophic decline, making the index appear to have performed better than it actually did.
Pythonimport numpy as np import pandas as pd import matplotlib.pyplot as plt # Simulate the survivorship bias effect # We will create a universe of 500 stocks and show what happens # when we only analyze the survivors. np.random.seed(42) n_stocks = 500 n_years = 20 n_days = n_years * 252 # Each stock has a random drift and volatility mu_annual = np.random.normal(0.07, 0.08, n_stocks) # avg 7% with dispersion sigma_annual = np.random.uniform(0.15, 0.50, n_stocks) # 15-50% annual vol # Convert to daily mu_daily = mu_annual / 252 sigma_daily = sigma_annual / np.sqrt(252) # Simulate price paths all_log_returns = np.zeros((n_days, n_stocks)) for i in range(n_stocks): all_log_returns[:, i] = np.random.normal(mu_daily[i], sigma_daily[i], n_days) # Cumulative returns (starting at $100) log_prices = np.cumsum(all_log_returns, axis=0) prices = 100 * np.exp(log_prices) # Define "delisted" as price falling below $1 (penny stock threshold) # Once delisted, the stock is gone forever survived = np.ones(n_stocks, dtype=bool) delisting_day = np.full(n_stocks, n_days) # default: survived entire period for i in range(n_stocks): below_threshold = np.where(prices[:, i] < 1.0)[0] if len(below_threshold) > 0: survived[i] = False delisting_day[i] = below_threshold[0] n_survived = survived.sum() n_failed = (~survived).sum() print(f"Universe: {n_stocks} stocks over {n_years} years") print(f"Survived: {n_survived} ({n_survived/n_stocks:.1%})") print(f"Failed: {n_failed} ({n_failed/n_stocks:.1%})")
2.4 Quantifying the Bias
Python# Compute annualized returns for each stock # For failed stocks, use the return up to the delisting day annualized_returns = np.zeros(n_stocks) for i in range(n_stocks): end_day = delisting_day[i] if not survived[i] else n_days - 1 if end_day > 0: total_log_ret = log_prices[end_day, i] - log_prices[0, i] years = end_day / 252 annualized_returns[i] = total_log_ret / years else: annualized_returns[i] = -1.0 # total loss on day 1 # For failed stocks that we "can't see," assign -100% return # (In reality, their final return is the loss at delisting) # Compare survivor-only vs full universe mean_survivor = annualized_returns[survived].mean() mean_all = annualized_returns.mean() bias = mean_survivor - mean_all print(f"\n=== Survivorship Bias Quantification ===") print(f"Mean annualized return (survivors only): {mean_survivor:.4f} ({mean_survivor*100:.2f}%)") print(f"Mean annualized return (full universe): {mean_all:.4f} ({mean_all*100:.2f}%)") print(f"Survivorship bias: {bias:.4f} ({bias*100:.2f}%/year)") print(f"Over {n_years} years, this compounds to: {((1+mean_survivor)**n_years / (1+mean_all)**n_years - 1)*100:.1f}% difference")
2.5 Visualizing Survivorship Bias
Pythonfig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Top-left: sample of surviving price paths survivor_idx = np.where(survived)[0][:30] for i in survivor_idx: axes[0, 0].plot(prices[:, i], alpha=0.4, linewidth=0.5, color='#38a169') axes[0, 0].set_title('Survivors Only (What You See)') axes[0, 0].set_ylabel('Price ($)') axes[0, 0].set_yscale('log') # Top-right: sample of ALL price paths (including failed) all_idx = np.random.choice(n_stocks, 50, replace=False) for i in all_idx: end = delisting_day[i] if not survived[i] else n_days color = '#38a169' if survived[i] else '#e53e3e' axes[0, 1].plot(prices[:end, i], alpha=0.4, linewidth=0.5, color=color) axes[0, 1].set_title('Full Universe (What Actually Happened)') axes[0, 1].set_ylabel('Price ($)') axes[0, 1].set_yscale('log') # Bottom-left: distribution of returns (survivors vs all) axes[1, 0].hist(annualized_returns[survived] * 100, bins=40, alpha=0.6, color='#38a169', label='Survivors', density=True) axes[1, 0].hist(annualized_returns * 100, bins=40, alpha=0.4, color='#e53e3e', label='All stocks', density=True) axes[1, 0].axvline(x=mean_survivor * 100, color='#38a169', linestyle='--', linewidth=2) axes[1, 0].axvline(x=mean_all * 100, color='#e53e3e', linestyle='--', linewidth=2) axes[1, 0].set_xlabel('Annualized Return (%)') axes[1, 0].set_title('Return Distributions') axes[1, 0].legend() # Bottom-right: cumulative average return over time # Equal-weighted portfolio of survivors vs all survivor_portfolio = np.zeros(n_days) all_portfolio = np.zeros(n_days) for t in range(n_days): active = [i for i in range(n_stocks) if delisting_day[i] > t] active_survivors = [i for i in range(n_stocks) if survived[i]] if len(active) > 0: all_portfolio[t] = all_log_returns[t, active].mean() if len(active_survivors) > 0: survivor_portfolio[t] = all_log_returns[t, active_survivors].mean() cum_all = np.exp(np.cumsum(all_portfolio)) * 100 cum_survivor = np.exp(np.cumsum(survivor_portfolio)) * 100 axes[1, 1].plot(cum_survivor, color='#38a169', linewidth=1.5, label='Survivors only') axes[1, 1].plot(cum_all, color='#e53e3e', linewidth=1.5, label='Full universe') axes[1, 1].set_title('Cumulative Growth of $100') axes[1, 1].set_ylabel('Portfolio Value ($)') axes[1, 1].legend() plt.tight_layout() plt.show()
3. Hedge Fund Database Biases
3.1 Self-Selection Bias
Hedge funds are not required to report their returns to any database. Reporting is voluntary. This creates a self-selection problem: funds that choose to report may be systematically different from those that do not.
- Funds with strong recent performance are more likely to report (to attract new investors).
- Funds that are closing or performing poorly often stop reporting before they shut down.
- New funds may only begin reporting once they have established a good track record.
3.2 Backfill (Instant History) Bias
Estimates suggest backfill bias inflates hedge fund database returns by approximately 1.4% per year on average. To mitigate this, some researchers discard the first 12–24 months of each fund’s history in the database.
3.3 Look-Ahead Bias
Look-ahead bias occurs when your analysis uses information that was not available at the time. In finance, common examples include:
- Using revised GDP figures instead of the preliminary release that was actually available to traders
- Using restated earnings instead of the originally reported figures
- Selecting stocks based on future membership in an index (e.g., testing “S&P 500 stocks” using today’s constituents applied to historical data)
3.4 Summary of Hedge Fund Biases
| Bias | Direction | Estimated Magnitude | Statistical Classification |
|---|---|---|---|
| Survivorship bias | Upward | +1.4% to +3.6%/year | MNAR (selection on outcome) |
| Backfill bias | Upward | +1.2% to +1.4%/year | MNAR (selection on past performance) |
| Self-selection bias | Upward | Varies | MNAR (voluntary reporting) |
| Look-ahead bias | Upward | Varies | Data leakage (not missing data per se) |
| Combined effect | Upward | +3% to +5%/year possible | — |
4. Missing Data Framework Applied to Finance
4.1 MCAR, MAR, and MNAR in Financial Context
Rubin (1976) classified missing data mechanisms into three categories. Here is how each manifests in financial data:
| Mechanism | Definition | Financial Example | Severity |
|---|---|---|---|
| MCAR (Missing Completely At Random) |
P(missing) is independent of both observed and unobserved data | A data vendor’s server crashes randomly, causing occasional gaps in the feed | Benign: complete-case analysis is unbiased |
| MAR (Missing At Random) |
P(missing) depends on observed data but not on the missing value itself | Small-cap stocks have more missing data because fewer analysts cover them (missingness depends on market cap, which we observe) | Moderate: can correct with proper imputation methods |
| MNAR (Missing Not At Random) |
P(missing) depends on the missing value itself | A stock is missing because it went bankrupt (missingness depends on the return, which is −100%) | Severe: cannot be fully corrected without external information |
4.2 Testing the Missing Data Mechanism
Python# Simulate and test whether missingness is related to the outcome # This mimics a hedge fund database where poorly performing funds stop reporting np.random.seed(123) n_funds = 500 n_months = 120 # 10 years # True monthly returns (normally distributed for simplicity) true_returns = np.random.normal(0.005, 0.03, (n_months, n_funds)) # MNAR mechanism: funds with cumulative returns below -20% stop reporting # (they shut down and are removed from the database) observed_returns = true_returns.copy() active = np.ones(n_funds, dtype=bool) stop_month = np.full(n_funds, n_months) for t in range(1, n_months): cum_ret = true_returns[:t+1].sum(axis=0) newly_dead = active & (cum_ret < -0.20) stop_month[newly_dead] = t active[newly_dead] = False observed_returns[t, ~active] = np.nan # Compare complete-case vs true statistics true_mean_monthly = np.nanmean(true_returns) observed_mean_monthly = np.nanmean(observed_returns) bias = observed_mean_monthly - true_mean_monthly print(f"=== MNAR Simulation: Hedge Fund Database ===") print(f"Funds that stopped reporting: {(~active).sum()} of {n_funds}") print(f"True mean monthly return: {true_mean_monthly*100:.4f}%") print(f"Observed mean monthly return: {observed_mean_monthly*100:.4f}%") print(f"Bias: {bias*100:.4f}%/month = {bias*1200:.2f}%/year") # Test whether missingness is related to returns (it is, by construction) # For each fund, compute its average return before it stopped vs after before_stop = [] for i in range(n_funds): if stop_month[i] < n_months: before_stop.append(true_returns[:stop_month[i], i].mean()) still_active = [true_returns[:, i].mean() for i in range(n_funds) if active[i]] print(f"\nMean return of funds that stopped: {np.mean(before_stop)*100:.4f}%/month") print(f"Mean return of surviving funds: {np.mean(still_active)*100:.4f}%/month") print(f"This confirms MNAR: stopping is correlated with performance")
5. Weekend and Holiday Gaps: Not Really Missing Data
5.1 Structural vs Random Gaps
Financial time series have gaps on weekends and holidays. These are structural gaps — they are perfectly predictable and affect all stocks equally. They are not missing data in the Rubin sense because there is no underlying value that “should” have been observed.
pd.bdate_range() and CustomBusinessDay
for exactly this purpose.
5.2 Handling Holidays Across Markets
Pythonimport yfinance as yf import pandas as pd # US stock (NYSE) and Japanese stock (TSE) have different holidays us_stock = yf.download("AAPL", start="2024-01-01", end="2024-03-01", progress=False) jp_stock = yf.download("7203.T", start="2024-01-01", end="2024-03-01", progress=False) print(f"AAPL trading days in Jan-Feb 2024: {len(us_stock)}") print(f"Toyota trading days in Jan-Feb 2024: {len(jp_stock)}") # Find dates where one market was open and the other closed us_dates = set(us_stock.index) jp_dates = set(jp_stock.index) us_only = us_dates - jp_dates jp_only = jp_dates - us_dates print(f"\nDays only US was open: {len(us_only)}") print(f"Days only Japan was open: {len(jp_only)}") # When merging, you must decide: inner join (common days only) # or outer join with NaN handling inner = pd.merge(us_stock[["Adj Close"]], jp_stock[["Adj Close"]], left_index=True, right_index=True, suffixes=("_US", "_JP"), how="inner") outer = pd.merge(us_stock[["Adj Close"]], jp_stock[["Adj Close"]], left_index=True, right_index=True, suffixes=("_US", "_JP"), how="outer") print(f"\nInner join: {len(inner)} common trading days") print(f"Outer join: {len(outer)} total days ({outer.isnull().sum().sum()} NaN values)")
5.3 Thin Trading and Stale Prices
Some assets — particularly small-cap stocks, corporate bonds, and illiquid securities — may not trade every day. When this happens, the reported “closing price” is typically the price of the last trade, which may have occurred hours or even days earlier. This is called a stale price.
Stale prices create artificial serial correlation in returns and artificial cross-correlation patterns. The Scholes-Williams and Dimson beta estimators were developed specifically to correct for this effect.
6. Delisted Stocks and What Happens to Their Data
6.1 Why Stocks Get Delisted
A stock is delisted when it is removed from an exchange. This can happen for several reasons, each with different implications for the data:
| Reason | Typical Final Return | Data Availability |
|---|---|---|
| Bankruptcy (Chapter 7) | −100% (total loss) | Often disappears from free databases |
| Bankruptcy (Chapter 11) | −80% to −100% | May have residual value; data often incomplete |
| Merger/Acquisition (premium) | +20% to +50% (takeover premium) | Data ends at acquisition date |
| Going private | +15% to +30% | Data ends at privatization |
| Price too low (exchange rules) | Varies (often negative) | May continue on OTC markets |
| Compliance violations | Varies | Data may be retroactively removed |
6.2 The Delisting Return Problem
When a stock is delisted, the delisting return — the return from the last traded price to the actual value received by shareholders — is often missing or difficult to determine. CRSP provides delisting returns for most US stocks, but many other databases do not.
Python# Demonstrate the impact of missing delisting returns np.random.seed(99) n_sim = 10000 n_stocks_sim = 100 n_months_sim = 120 # Track the bias from ignoring vs including delisting returns biases = [] for _ in range(n_sim): # Each stock has a 2% annual probability of delisting for negative reasons # and a 1% annual probability of delisting for positive reasons (acquisition) monthly_prob_neg = 1 - (1 - 0.02) ** (1/12) monthly_prob_pos = 1 - (1 - 0.01) ** (1/12) returns = np.random.normal(0.006, 0.04, n_stocks_sim) # avg monthly return # Apply delisting delisted_neg = np.random.random(n_stocks_sim) < monthly_prob_neg delisted_pos = np.random.random(n_stocks_sim) < monthly_prob_pos # Negative delistings: replace return with -30% (typical bankruptcy return) returns[delisted_neg] = -0.30 # Positive delistings: replace return with +25% (typical takeover premium) returns[delisted_pos & ~delisted_neg] = 0.25 # True average includes delisting returns true_avg = returns.mean() # "Observed" average drops delisted stocks entirely observed = returns[~delisted_neg & ~delisted_pos] if len(observed) > 0: obs_avg = observed.mean() biases.append(obs_avg - true_avg) biases = np.array(biases) print(f"=== Delisting Return Bias ===") print(f"Mean bias per month: {biases.mean()*100:.4f}%") print(f"Mean bias per year: {biases.mean()*1200:.2f}%") print(f"Note: positive bias because negative delistings are more harmful than") print(f"positive delistings are beneficial, and both are excluded")
7. Practical Strategies for Mitigating These Biases
7.1 Use Survivorship-Bias-Free Databases
| Database | Coverage | Survivorship-Free? | Cost |
|---|---|---|---|
| CRSP | US stocks (NYSE, AMEX, NASDAQ) | Yes (includes all delisted stocks) | Academic subscription |
| Compustat | Fundamental data (income, balance sheet) | Mostly (some historical gaps) | Academic subscription |
| Datastream | Global stocks, bonds, macro | Partially (dead stocks available) | Commercial |
| Bloomberg | Global (everything) | Yes (delisted securities available) | $$$$ |
| yfinance (free) | Current listings | No (major survivorship bias) | Free |
7.2 Point-in-Time Constituents
When studying an index like the S&P 500, always use the point-in-time constituent list — the list of stocks that were in the index at each historical date — rather than today’s list applied retroactively. This eliminates the most common form of survivorship bias in index studies.
Python# Pseudocode for point-in-time analysis # (Actual constituent data requires CRSP or similar database) def backtest_with_point_in_time(strategy, constituent_history, price_data): """ Correct backtesting approach using point-in-time index membership. Parameters: strategy: function that selects stocks from available universe constituent_history: dict mapping dates to lists of tickers price_data: historical prices for ALL stocks (including delisted) Returns: portfolio returns without survivorship bias """ portfolio_returns = [] for date in sorted(constituent_history.keys()): # Only consider stocks that were ACTUALLY in the index on this date available_stocks = constituent_history[date] # Apply strategy to the historically correct universe selected = strategy(available_stocks, price_data, date) # Compute returns including any delistings ret = compute_returns(selected, price_data, date) portfolio_returns.append(ret) return portfolio_returns # WRONG approach (survivorship bias): # current_sp500 = get_current_sp500_tickers() # Today's list # historical_returns = get_returns(current_sp500, "2005-01-01", "2025-01-01") # This backfills today's "winners" into the historical universe!
7.3 Sensitivity Analysis
Python# When you cannot get survivorship-free data, quantify the potential bias # by running a sensitivity analysis def sensitivity_analysis(observed_returns, delisting_rates, delisting_returns): """ Estimate the range of true returns given assumptions about survivorship bias. Parameters: observed_returns: average return from survivor-only sample delisting_rates: list of assumed annual delisting rates to test delisting_returns: list of assumed average delisting returns to test """ print(f"Observed (survivor-only) annualized return: {observed_returns:.2%}") print(f"\nSensitivity to survivorship bias assumptions:") print(f"{'Delist Rate':>12} {'Delist Return':>14} {'True Return':>14} {'Bias':>10}") print("-" * 55) for rate in delisting_rates: for delist_ret in delisting_returns: # Approximate: true_return = (1 - rate) * observed + rate * delist_return true_return = (1 - rate) * observed_returns + rate * delist_ret bias = observed_returns - true_return print(f"{rate:>11.1%} {delist_ret:>13.1%} {true_return:>13.2%} {bias:>9.2%}") # Example: observed return is 10% annualized sensitivity_analysis( observed_returns=0.10, delisting_rates=[0.02, 0.05, 0.08, 0.10], delisting_returns=[-0.30, -0.50, -1.00] )
8. Comprehensive Simulation: Measuring the Full Effect
Python# Monte Carlo simulation: full survivorship bias analysis # We simulate a realistic stock universe and measure the bias # from analyzing only survivors. np.random.seed(2024) n_simulations = 100 n_stocks = 1000 n_years = 20 n_months = n_years * 12 survivor_means = [] true_means = [] for sim in range(n_simulations): # Each stock: random drift and volatility mu = np.random.normal(0.005, 0.005, n_stocks) # monthly drift sigma = np.random.uniform(0.03, 0.12, n_stocks) # monthly vol # Simulate monthly log returns log_ret = np.zeros((n_months, n_stocks)) for i in range(n_stocks): log_ret[:, i] = np.random.normal(mu[i], sigma[i], n_months) # Cumulative log returns cum_log_ret = np.cumsum(log_ret, axis=0) # Delisting rule: if cumulative return falls below -80% survived = np.ones(n_stocks, dtype=bool) for i in range(n_stocks): if np.any(cum_log_ret[:, i] < np.log(0.20)): # 80% loss survived[i] = False # Average annualized return total_ret_all = cum_log_ret[-1, :].mean() / n_years total_ret_survivors = cum_log_ret[-1, survived].mean() / n_years true_means.append(total_ret_all) survivor_means.append(total_ret_survivors) true_means = np.array(true_means) survivor_means = np.array(survivor_means) bias_distribution = survivor_means - true_means print(f"=== Monte Carlo: Survivorship Bias Distribution ===") print(f"Number of simulations: {n_simulations}") print(f"Universe: {n_stocks} stocks, {n_years} years each") print(f"\nTrue annualized return: {true_means.mean()*100:.2f}% +/- {true_means.std()*100:.2f}%") print(f"Survivor annualized return: {survivor_means.mean()*100:.2f}% +/- {survivor_means.std()*100:.2f}%") print(f"Average bias: {bias_distribution.mean()*100:.2f}%/year") print(f"95% CI for bias: [{np.percentile(bias_distribution, 2.5)*100:.2f}%, {np.percentile(bias_distribution, 97.5)*100:.2f}%]") # Plot fig, axes = plt.subplots(1, 2, figsize=(14, 5)) axes[0].hist(bias_distribution * 100, bins=30, color='#e53e3e', alpha=0.7, edgecolor='white') axes[0].axvline(x=bias_distribution.mean() * 100, color='black', linestyle='--', linewidth=2, label=f'Mean = {bias_distribution.mean()*100:.2f}%') axes[0].set_xlabel('Survivorship Bias (%/year)') axes[0].set_ylabel('Frequency') axes[0].set_title('Distribution of Survivorship Bias Across Simulations') axes[0].legend() axes[1].scatter(true_means * 100, survivor_means * 100, alpha=0.6, color='#3182ce', s=20) axes[1].plot([-5, 15], [-5, 15], 'k--', linewidth=1, label='No bias line') axes[1].set_xlabel('True Annualized Return (%)') axes[1].set_ylabel('Survivor-Only Annualized Return (%)') axes[1].set_title('True vs Survivor-Only Returns') axes[1].legend() plt.tight_layout() plt.show()
9. Chapter Summary
| Concept | Statistical Framework | Practical Recommendation |
|---|---|---|
| Survivorship bias | MNAR selection bias; conditioning on outcome | Use survivorship-free databases (CRSP, Bloomberg) |
| Backfill bias | MNAR; voluntary entry with backfilled history | Drop first 12–24 months of each fund’s database history |
| Look-ahead bias | Data leakage; using future info in historical analysis | Use point-in-time data; avoid revised figures |
| Weekend/holiday gaps | Structural (deterministic) missing; not random | Use business-day index; do not impute |
| Cross-market gaps | Differential structural missingness | Use inner join for correlation; outer join with care |
| Stale prices | Measurement error / errors-in-variables | Use Scholes-Williams or Dimson beta adjustment |
| Delisting returns | MNAR; missing outcomes for failed entities | Use CRSP delisting returns; run sensitivity analysis |
In the next module, we will shift from data quality issues to modeling: how to estimate and forecast the volatility that we have seen is such a dominant feature of financial return data.