Module 05: Missing Data & Survivorship Bias

The data you do not see can bias everything you measure

Part 1 of 5 Module 05 of 22

← Previous Module 05 of 22 Next →

1. Introduction: The Data You Do Not See

Every statistician knows that missing data can bias estimates. What makes financial data special is that the missing data is systematically missing in ways that are directly correlated with the quantity you are trying to measure. The companies that disappeared from the stock market are disproportionately the ones that failed. The hedge funds that stopped reporting are disproportionately the ones that performed poorly. The trading strategies that never made it into textbooks are disproportionately the ones that did not work.

This module applies the statistical framework of missing data — MCAR, MAR, MNAR — to financial datasets, and quantifies how large the resulting biases can be.

Stats Bridge

If you have studied Rubin’s missing data framework, you already have the conceptual tools for this module. The key insight is that most missing financial data is MNAR (Missing Not At Random): the probability of a data point being missing depends on the unobserved value itself. A company is missing from today’s dataset because it went bankrupt — and bankruptcy is directly related to its return (which would have been −100%). This is the hardest type of missing data to handle, and the most common in finance.

2. Survivorship Bias: The Silent Killer of Backtests

2.1 What Is Survivorship Bias?

Finance Term

Survivorship Bias: The systematic error that arises from analyzing only the entities (companies, funds, strategies) that survived to the present, while ignoring those that failed, were delisted, or stopped reporting. This makes the surviving population appear more successful than the true population.

Imagine you want to evaluate the average performance of US stocks over the past 20 years. If you download today’s S&P 500 constituents and look up their historical returns, you are only looking at the 500 companies that are successful today. Companies that went bankrupt, were acquired at distressed prices, or shrank out of the index are excluded. Your sample is conditioned on the outcome variable — a textbook selection bias.

2.2 The Magnitude of the Bias

Academic research has quantified the survivorship bias in various contexts:

Context	Estimated Bias	Source
US mutual fund average returns	+0.9% to +1.5% per year	Elton, Gruber, Blake (1996)
Hedge fund database returns	+1.4% to +3.6% per year	Malkiel, Saha (2005)
US stock universe average returns	+0.5% to +1.0% per year	Various studies using CRSP
International equity indices	+1.0% to +2.5% per year	Dimson, Marsh, Staunton (2002)

Key Insight

A 1.5% annual survivorship bias may sound small, but compounded over 20 years it means you overestimate total returns by approximately 35%. A strategy that appears to have returned 300% over 20 years may have actually returned only 220% on a survivorship-free basis. This is not a minor correction — it can be the difference between a strategy appearing profitable and appearing mediocre.

2.3 A Concrete Example

Consider the Dow Jones Industrial Average. In 2005, the DJIA included General Motors (GM). By 2009, GM had filed for bankruptcy and was removed from the index. If you backtest a strategy on the “current DJIA constituents” going back to 2005, you would not include GM’s catastrophic decline, making the index appear to have performed better than it actually did.

Pythonimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Simulate the survivorship bias effect
# We will create a universe of 500 stocks and show what happens
# when we only analyze the survivors.

np.random.seed(42)

n_stocks = 500
n_years = 20
n_days = n_years * 252

# Each stock has a random drift and volatility
mu_annual = np.random.normal(0.07, 0.08, n_stocks)  # avg 7% with dispersion
sigma_annual = np.random.uniform(0.15, 0.50, n_stocks)  # 15-50% annual vol

# Convert to daily
mu_daily = mu_annual / 252
sigma_daily = sigma_annual / np.sqrt(252)

# Simulate price paths
all_log_returns = np.zeros((n_days, n_stocks))
for i in range(n_stocks):
    all_log_returns[:, i] = np.random.normal(mu_daily[i], sigma_daily[i], n_days)

# Cumulative returns (starting at $100)
log_prices = np.cumsum(all_log_returns, axis=0)
prices = 100 * np.exp(log_prices)

# Define "delisted" as price falling below $1 (penny stock threshold)
# Once delisted, the stock is gone forever
survived = np.ones(n_stocks, dtype=bool)
delisting_day = np.full(n_stocks, n_days)  # default: survived entire period

for i in range(n_stocks):
    below_threshold = np.where(prices[:, i] < 1.0)[0]
    if len(below_threshold) > 0:
        survived[i] = False
        delisting_day[i] = below_threshold[0]

n_survived = survived.sum()
n_failed = (~survived).sum()

print(f"Universe: {n_stocks} stocks over {n_years} years")
print(f"Survived: {n_survived} ({n_survived/n_stocks:.1%})")
print(f"Failed:   {n_failed} ({n_failed/n_stocks:.1%})")

2.4 Quantifying the Bias

Python# Compute annualized returns for each stock
# For failed stocks, use the return up to the delisting day
annualized_returns = np.zeros(n_stocks)

for i in range(n_stocks):
    end_day = delisting_day[i] if not survived[i] else n_days - 1
    if end_day > 0:
        total_log_ret = log_prices[end_day, i] - log_prices[0, i]
        years = end_day / 252
        annualized_returns[i] = total_log_ret / years
    else:
        annualized_returns[i] = -1.0  # total loss on day 1

# For failed stocks that we "can't see," assign -100% return
# (In reality, their final return is the loss at delisting)

# Compare survivor-only vs full universe
mean_survivor = annualized_returns[survived].mean()
mean_all = annualized_returns.mean()
bias = mean_survivor - mean_all

print(f"\n=== Survivorship Bias Quantification ===")
print(f"Mean annualized return (survivors only): {mean_survivor:.4f} ({mean_survivor*100:.2f}%)")
print(f"Mean annualized return (full universe):  {mean_all:.4f} ({mean_all*100:.2f}%)")
print(f"Survivorship bias:                       {bias:.4f} ({bias*100:.2f}%/year)")
print(f"Over {n_years} years, this compounds to: {((1+mean_survivor)**n_years / (1+mean_all)**n_years - 1)*100:.1f}% difference")

2.5 Visualizing Survivorship Bias

Pythonfig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Top-left: sample of surviving price paths
survivor_idx = np.where(survived)[0][:30]
for i in survivor_idx:
    axes[0, 0].plot(prices[:, i], alpha=0.4, linewidth=0.5, color='#38a169')
axes[0, 0].set_title('Survivors Only (What You See)')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_yscale('log')

# Top-right: sample of ALL price paths (including failed)
all_idx = np.random.choice(n_stocks, 50, replace=False)
for i in all_idx:
    end = delisting_day[i] if not survived[i] else n_days
    color = '#38a169' if survived[i] else '#e53e3e'
    axes[0, 1].plot(prices[:end, i], alpha=0.4, linewidth=0.5, color=color)
axes[0, 1].set_title('Full Universe (What Actually Happened)')
axes[0, 1].set_ylabel('Price ($)')
axes[0, 1].set_yscale('log')

# Bottom-left: distribution of returns (survivors vs all)
axes[1, 0].hist(annualized_returns[survived] * 100, bins=40, alpha=0.6,
               color='#38a169', label='Survivors', density=True)
axes[1, 0].hist(annualized_returns * 100, bins=40, alpha=0.4,
               color='#e53e3e', label='All stocks', density=True)
axes[1, 0].axvline(x=mean_survivor * 100, color='#38a169', linestyle='--', linewidth=2)
axes[1, 0].axvline(x=mean_all * 100, color='#e53e3e', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Annualized Return (%)')
axes[1, 0].set_title('Return Distributions')
axes[1, 0].legend()

# Bottom-right: cumulative average return over time
# Equal-weighted portfolio of survivors vs all
survivor_portfolio = np.zeros(n_days)
all_portfolio = np.zeros(n_days)
for t in range(n_days):
    active = [i for i in range(n_stocks) if delisting_day[i] > t]
    active_survivors = [i for i in range(n_stocks) if survived[i]]

    if len(active) > 0:
        all_portfolio[t] = all_log_returns[t, active].mean()
    if len(active_survivors) > 0:
        survivor_portfolio[t] = all_log_returns[t, active_survivors].mean()

cum_all = np.exp(np.cumsum(all_portfolio)) * 100
cum_survivor = np.exp(np.cumsum(survivor_portfolio)) * 100

axes[1, 1].plot(cum_survivor, color='#38a169', linewidth=1.5, label='Survivors only')
axes[1, 1].plot(cum_all, color='#e53e3e', linewidth=1.5, label='Full universe')
axes[1, 1].set_title('Cumulative Growth of $100')
axes[1, 1].set_ylabel('Portfolio Value ($)')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

Common Pitfall

Many popular financial data sources (including free APIs like yfinance) only provide data for currently listed companies. If you download “all NASDAQ stocks” using one of these sources, you are automatically subject to survivorship bias because delisted companies are excluded. For research, you need a survivorship-bias-free database like CRSP, which includes historical data for every stock that was ever listed, including those that subsequently failed or were delisted.

3. Hedge Fund Database Biases

3.1 Self-Selection Bias

Hedge funds are not required to report their returns to any database. Reporting is voluntary. This creates a self-selection problem: funds that choose to report may be systematically different from those that do not.

Funds with strong recent performance are more likely to report (to attract new investors).
Funds that are closing or performing poorly often stop reporting before they shut down.
New funds may only begin reporting once they have established a good track record.

3.2 Backfill (Instant History) Bias

Finance Term

Backfill Bias: When a fund joins a database, it typically submits its entire historical track record. Since funds with poor track records are less likely to join, the backfilled history tends to show above-average performance. This creates an upward bias in the historical database, even for periods before the fund started reporting.

Estimates suggest backfill bias inflates hedge fund database returns by approximately 1.4% per year on average. To mitigate this, some researchers discard the first 12–24 months of each fund’s history in the database.

3.3 Look-Ahead Bias

Look-ahead bias occurs when your analysis uses information that was not available at the time. In finance, common examples include:

Using revised GDP figures instead of the preliminary release that was actually available to traders
Using restated earnings instead of the originally reported figures
Selecting stocks based on future membership in an index (e.g., testing “S&P 500 stocks” using today’s constituents applied to historical data)

Stats Bridge

Look-ahead bias is a form of data leakage — the same concept you encounter in machine learning when test set information leaks into the training process. Just as a model that accidentally trains on test data will appear to have unrealistically high accuracy, a trading strategy that uses future information will appear to have unrealistically high returns.

3.4 Summary of Hedge Fund Biases

Bias	Direction	Estimated Magnitude	Statistical Classification
Survivorship bias	Upward	+1.4% to +3.6%/year	MNAR (selection on outcome)
Backfill bias	Upward	+1.2% to +1.4%/year	MNAR (selection on past performance)
Self-selection bias	Upward	Varies	MNAR (voluntary reporting)
Look-ahead bias	Upward	Varies	Data leakage (not missing data per se)
Combined effect	Upward	+3% to +5%/year possible	—

Key Insight

Hedge fund databases report average annualized returns of roughly 8–12% for various strategies. After correcting for survivorship, backfill, and self-selection biases, the true average may be closer to 4–7%. Combined biases of 3–5% per year mean that a significant fraction of the “alpha” (excess return) reported by hedge funds may be entirely illusory.

4. Missing Data Framework Applied to Finance

4.1 MCAR, MAR, and MNAR in Financial Context

Rubin (1976) classified missing data mechanisms into three categories. Here is how each manifests in financial data:

Mechanism	Definition	Financial Example	Severity
MCAR (Missing Completely At Random)	P(missing) is independent of both observed and unobserved data	A data vendor’s server crashes randomly, causing occasional gaps in the feed	Benign: complete-case analysis is unbiased
MAR (Missing At Random)	P(missing) depends on observed data but not on the missing value itself	Small-cap stocks have more missing data because fewer analysts cover them (missingness depends on market cap, which we observe)	Moderate: can correct with proper imputation methods
MNAR (Missing Not At Random)	P(missing) depends on the missing value itself	A stock is missing because it went bankrupt (missingness depends on the return, which is −100%)	Severe: cannot be fully corrected without external information

4.2 Testing the Missing Data Mechanism

Python# Simulate and test whether missingness is related to the outcome
# This mimics a hedge fund database where poorly performing funds stop reporting

np.random.seed(123)
n_funds = 500
n_months = 120  # 10 years

# True monthly returns (normally distributed for simplicity)
true_returns = np.random.normal(0.005, 0.03, (n_months, n_funds))

# MNAR mechanism: funds with cumulative returns below -20% stop reporting
# (they shut down and are removed from the database)
observed_returns = true_returns.copy()
active = np.ones(n_funds, dtype=bool)
stop_month = np.full(n_funds, n_months)

for t in range(1, n_months):
    cum_ret = true_returns[:t+1].sum(axis=0)
    newly_dead = active & (cum_ret < -0.20)
    stop_month[newly_dead] = t
    active[newly_dead] = False
    observed_returns[t, ~active] = np.nan

# Compare complete-case vs true statistics
true_mean_monthly = np.nanmean(true_returns)
observed_mean_monthly = np.nanmean(observed_returns)
bias = observed_mean_monthly - true_mean_monthly

print(f"=== MNAR Simulation: Hedge Fund Database ===")
print(f"Funds that stopped reporting: {(~active).sum()} of {n_funds}")
print(f"True mean monthly return:     {true_mean_monthly*100:.4f}%")
print(f"Observed mean monthly return:  {observed_mean_monthly*100:.4f}%")
print(f"Bias:                          {bias*100:.4f}%/month = {bias*1200:.2f}%/year")

# Test whether missingness is related to returns (it is, by construction)
# For each fund, compute its average return before it stopped vs after
before_stop = []
for i in range(n_funds):
    if stop_month[i] < n_months:
        before_stop.append(true_returns[:stop_month[i], i].mean())

still_active = [true_returns[:, i].mean() for i in range(n_funds) if active[i]]

print(f"\nMean return of funds that stopped: {np.mean(before_stop)*100:.4f}%/month")
print(f"Mean return of surviving funds:    {np.mean(still_active)*100:.4f}%/month")
print(f"This confirms MNAR: stopping is correlated with performance")

Stats Bridge

In the Rubin framework, MNAR data cannot be handled by standard missing data techniques (multiple imputation, EM algorithm) without making untestable assumptions about the missing data mechanism. The standard approaches — selection models (Heckman correction) and pattern mixture models — require specifying how missingness relates to the unobserved values. In finance, this means you need a model for why funds shut down as a function of their (unobserved) counterfactual returns.

5. Weekend and Holiday Gaps: Not Really Missing Data

5.1 Structural vs Random Gaps

Financial time series have gaps on weekends and holidays. These are structural gaps — they are perfectly predictable and affect all stocks equally. They are not missing data in the Rubin sense because there is no underlying value that “should” have been observed.

Key Insight

Weekend gaps are deterministic, not random. Do not impute missing weekend observations by interpolation or other methods — this would create artificial data and distort the statistical properties of the series. Instead, use a business day index that skips weekends and holidays entirely. Pandas provides pd.bdate_range() and CustomBusinessDay for exactly this purpose.

5.2 Handling Holidays Across Markets

Pythonimport yfinance as yf
import pandas as pd

# US stock (NYSE) and Japanese stock (TSE) have different holidays
us_stock = yf.download("AAPL", start="2024-01-01", end="2024-03-01", progress=False)
jp_stock = yf.download("7203.T", start="2024-01-01", end="2024-03-01", progress=False)

print(f"AAPL trading days in Jan-Feb 2024: {len(us_stock)}")
print(f"Toyota trading days in Jan-Feb 2024: {len(jp_stock)}")

# Find dates where one market was open and the other closed
us_dates = set(us_stock.index)
jp_dates = set(jp_stock.index)

us_only = us_dates - jp_dates
jp_only = jp_dates - us_dates

print(f"\nDays only US was open:    {len(us_only)}")
print(f"Days only Japan was open: {len(jp_only)}")

# When merging, you must decide: inner join (common days only)
# or outer join with NaN handling
inner = pd.merge(us_stock[["Adj Close"]], jp_stock[["Adj Close"]],
                 left_index=True, right_index=True,
                 suffixes=("_US", "_JP"), how="inner")
outer = pd.merge(us_stock[["Adj Close"]], jp_stock[["Adj Close"]],
                 left_index=True, right_index=True,
                 suffixes=("_US", "_JP"), how="outer")

print(f"\nInner join: {len(inner)} common trading days")
print(f"Outer join: {len(outer)} total days ({outer.isnull().sum().sum()} NaN values)")

Common Pitfall

When computing cross-market correlations, using an outer join and forward-filling the NaN values introduces a subtle bias: the “return” on a holiday will be zero (since the forward-filled price is unchanged), which drags the correlation toward zero. Use an inner join (common trading days only) for correlation analysis, or compute returns only on days when both markets were open.

5.3 Thin Trading and Stale Prices

Some assets — particularly small-cap stocks, corporate bonds, and illiquid securities — may not trade every day. When this happens, the reported “closing price” is typically the price of the last trade, which may have occurred hours or even days earlier. This is called a stale price.

Stale prices create artificial serial correlation in returns and artificial cross-correlation patterns. The Scholes-Williams and Dimson beta estimators were developed specifically to correct for this effect.

Stats Bridge

Stale prices are a form of measurement error. The observed price is the true price from some previous time point, not the current time. This creates an errors-in-variables problem that attenuates estimates of beta (the sensitivity to market movements) and inflates apparent diversification benefits. Scholes-Williams corrects for this by including leading and lagging market returns in the beta regression: β_adj = (β₋₁ + β₀ + β₊₁) / (1 + 2ρ_m).

6. Delisted Stocks and What Happens to Their Data

6.1 Why Stocks Get Delisted

A stock is delisted when it is removed from an exchange. This can happen for several reasons, each with different implications for the data:

Reason	Typical Final Return	Data Availability
Bankruptcy (Chapter 7)	−100% (total loss)	Often disappears from free databases
Bankruptcy (Chapter 11)	−80% to −100%	May have residual value; data often incomplete
Merger/Acquisition (premium)	+20% to +50% (takeover premium)	Data ends at acquisition date
Going private	+15% to +30%	Data ends at privatization
Price too low (exchange rules)	Varies (often negative)	May continue on OTC markets
Compliance violations	Varies	Data may be retroactively removed

6.2 The Delisting Return Problem

When a stock is delisted, the delisting return — the return from the last traded price to the actual value received by shareholders — is often missing or difficult to determine. CRSP provides delisting returns for most US stocks, but many other databases do not.

Python# Demonstrate the impact of missing delisting returns
np.random.seed(99)

n_sim = 10000
n_stocks_sim = 100
n_months_sim = 120

# Track the bias from ignoring vs including delisting returns
biases = []

for _ in range(n_sim):
    # Each stock has a 2% annual probability of delisting for negative reasons
    # and a 1% annual probability of delisting for positive reasons (acquisition)
    monthly_prob_neg = 1 - (1 - 0.02) ** (1/12)
    monthly_prob_pos = 1 - (1 - 0.01) ** (1/12)

    returns = np.random.normal(0.006, 0.04, n_stocks_sim)  # avg monthly return

    # Apply delisting
    delisted_neg = np.random.random(n_stocks_sim) < monthly_prob_neg
    delisted_pos = np.random.random(n_stocks_sim) < monthly_prob_pos

    # Negative delistings: replace return with -30% (typical bankruptcy return)
    returns[delisted_neg] = -0.30
    # Positive delistings: replace return with +25% (typical takeover premium)
    returns[delisted_pos & ~delisted_neg] = 0.25

    # True average includes delisting returns
    true_avg = returns.mean()

    # "Observed" average drops delisted stocks entirely
    observed = returns[~delisted_neg & ~delisted_pos]
    if len(observed) > 0:
        obs_avg = observed.mean()
        biases.append(obs_avg - true_avg)

biases = np.array(biases)
print(f"=== Delisting Return Bias ===")
print(f"Mean bias per month: {biases.mean()*100:.4f}%")
print(f"Mean bias per year:  {biases.mean()*1200:.2f}%")
print(f"Note: positive bias because negative delistings are more harmful than")
print(f"positive delistings are beneficial, and both are excluded")

Key Insight

The direction of the delisting bias depends on the relative frequency and magnitude of negative vs positive delistings. In the US stock market, negative delistings (bankruptcies) cause larger losses than positive delistings (acquisitions) create gains, so the net bias from ignoring delisting returns is positive — making the average stock return appear higher than it truly was.

7. Practical Strategies for Mitigating These Biases

7.1 Use Survivorship-Bias-Free Databases

Database	Coverage	Survivorship-Free?	Cost
CRSP	US stocks (NYSE, AMEX, NASDAQ)	Yes (includes all delisted stocks)	Academic subscription
Compustat	Fundamental data (income, balance sheet)	Mostly (some historical gaps)	Academic subscription
Datastream	Global stocks, bonds, macro	Partially (dead stocks available)	Commercial
Bloomberg	Global (everything)	Yes (delisted securities available)	$$$$
yfinance (free)	Current listings	No (major survivorship bias)	Free

7.2 Point-in-Time Constituents

When studying an index like the S&P 500, always use the point-in-time constituent list — the list of stocks that were in the index at each historical date — rather than today’s list applied retroactively. This eliminates the most common form of survivorship bias in index studies.

Python# Pseudocode for point-in-time analysis
# (Actual constituent data requires CRSP or similar database)

def backtest_with_point_in_time(strategy, constituent_history, price_data):
    """
    Correct backtesting approach using point-in-time index membership.

    Parameters:
        strategy: function that selects stocks from available universe
        constituent_history: dict mapping dates to lists of tickers
        price_data: historical prices for ALL stocks (including delisted)

    Returns:
        portfolio returns without survivorship bias
    """
    portfolio_returns = []

    for date in sorted(constituent_history.keys()):
        # Only consider stocks that were ACTUALLY in the index on this date
        available_stocks = constituent_history[date]

        # Apply strategy to the historically correct universe
        selected = strategy(available_stocks, price_data, date)

        # Compute returns including any delistings
        ret = compute_returns(selected, price_data, date)
        portfolio_returns.append(ret)

    return portfolio_returns

# WRONG approach (survivorship bias):
# current_sp500 = get_current_sp500_tickers()  # Today's list
# historical_returns = get_returns(current_sp500, "2005-01-01", "2025-01-01")
# This backfills today's "winners" into the historical universe!

7.3 Sensitivity Analysis

Python# When you cannot get survivorship-free data, quantify the potential bias
# by running a sensitivity analysis

def sensitivity_analysis(observed_returns, delisting_rates, delisting_returns):
    """
    Estimate the range of true returns given assumptions about
    survivorship bias.

    Parameters:
        observed_returns: average return from survivor-only sample
        delisting_rates: list of assumed annual delisting rates to test
        delisting_returns: list of assumed average delisting returns to test
    """
    print(f"Observed (survivor-only) annualized return: {observed_returns:.2%}")
    print(f"\nSensitivity to survivorship bias assumptions:")
    print(f"{'Delist Rate':>12} {'Delist Return':>14} {'True Return':>14} {'Bias':>10}")
    print("-" * 55)

    for rate in delisting_rates:
        for delist_ret in delisting_returns:
            # Approximate: true_return = (1 - rate) * observed + rate * delist_return
            true_return = (1 - rate) * observed_returns + rate * delist_ret
            bias = observed_returns - true_return
            print(f"{rate:>11.1%} {delist_ret:>13.1%} {true_return:>13.2%} {bias:>9.2%}")

# Example: observed return is 10% annualized
sensitivity_analysis(
    observed_returns=0.10,
    delisting_rates=[0.02, 0.05, 0.08, 0.10],
    delisting_returns=[-0.30, -0.50, -1.00]
)

Common Pitfall

When you present results based on freely available data (which likely has survivorship bias), always acknowledge the limitation and provide a rough estimate of the potential bias. A simple disclosure like “these results may overstate returns by 1–2% per year due to survivorship bias in the data source” is far better than ignoring the issue entirely.

8. Comprehensive Simulation: Measuring the Full Effect

Python# Monte Carlo simulation: full survivorship bias analysis
# We simulate a realistic stock universe and measure the bias
# from analyzing only survivors.

np.random.seed(2024)

n_simulations = 100
n_stocks = 1000
n_years = 20
n_months = n_years * 12

survivor_means = []
true_means = []

for sim in range(n_simulations):
    # Each stock: random drift and volatility
    mu = np.random.normal(0.005, 0.005, n_stocks)  # monthly drift
    sigma = np.random.uniform(0.03, 0.12, n_stocks)  # monthly vol

    # Simulate monthly log returns
    log_ret = np.zeros((n_months, n_stocks))
    for i in range(n_stocks):
        log_ret[:, i] = np.random.normal(mu[i], sigma[i], n_months)

    # Cumulative log returns
    cum_log_ret = np.cumsum(log_ret, axis=0)

    # Delisting rule: if cumulative return falls below -80%
    survived = np.ones(n_stocks, dtype=bool)
    for i in range(n_stocks):
        if np.any(cum_log_ret[:, i] < np.log(0.20)):  # 80% loss
            survived[i] = False

    # Average annualized return
    total_ret_all = cum_log_ret[-1, :].mean() / n_years
    total_ret_survivors = cum_log_ret[-1, survived].mean() / n_years

    true_means.append(total_ret_all)
    survivor_means.append(total_ret_survivors)

true_means = np.array(true_means)
survivor_means = np.array(survivor_means)
bias_distribution = survivor_means - true_means

print(f"=== Monte Carlo: Survivorship Bias Distribution ===")
print(f"Number of simulations: {n_simulations}")
print(f"Universe: {n_stocks} stocks, {n_years} years each")
print(f"\nTrue annualized return:     {true_means.mean()*100:.2f}% +/- {true_means.std()*100:.2f}%")
print(f"Survivor annualized return: {survivor_means.mean()*100:.2f}% +/- {survivor_means.std()*100:.2f}%")
print(f"Average bias:               {bias_distribution.mean()*100:.2f}%/year")
print(f"95% CI for bias:            [{np.percentile(bias_distribution, 2.5)*100:.2f}%, {np.percentile(bias_distribution, 97.5)*100:.2f}%]")

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(bias_distribution * 100, bins=30, color='#e53e3e', alpha=0.7,
            edgecolor='white')
axes[0].axvline(x=bias_distribution.mean() * 100, color='black', linestyle='--',
               linewidth=2, label=f'Mean = {bias_distribution.mean()*100:.2f}%')
axes[0].set_xlabel('Survivorship Bias (%/year)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Survivorship Bias Across Simulations')
axes[0].legend()

axes[1].scatter(true_means * 100, survivor_means * 100, alpha=0.6,
               color='#3182ce', s=20)
axes[1].plot([-5, 15], [-5, 15], 'k--', linewidth=1, label='No bias line')
axes[1].set_xlabel('True Annualized Return (%)')
axes[1].set_ylabel('Survivor-Only Annualized Return (%)')
axes[1].set_title('True vs Survivor-Only Returns')
axes[1].legend()

plt.tight_layout()
plt.show()

9. Chapter Summary

Concept	Statistical Framework	Practical Recommendation
Survivorship bias	MNAR selection bias; conditioning on outcome	Use survivorship-free databases (CRSP, Bloomberg)
Backfill bias	MNAR; voluntary entry with backfilled history	Drop first 12–24 months of each fund’s database history
Look-ahead bias	Data leakage; using future info in historical analysis	Use point-in-time data; avoid revised figures
Weekend/holiday gaps	Structural (deterministic) missing; not random	Use business-day index; do not impute
Cross-market gaps	Differential structural missingness	Use inner join for correlation; outer join with care
Stale prices	Measurement error / errors-in-variables	Use Scholes-Williams or Dimson beta adjustment
Delisting returns	MNAR; missing outcomes for failed entities	Use CRSP delisting returns; run sensitivity analysis

Key Insight

The biases described in this module all push in the same direction: they make financial performance appear better than it actually was. Survivorship bias, backfill bias, and delisting return bias all inflate reported returns. This means that any uncorrected analysis of financial data is likely to be optimistic. As a statistician, your instinct to worry about selection bias is exactly the right instinct — and the magnitude of these biases (1–5% per year) is large enough to change substantive conclusions.

In the next module, we will shift from data quality issues to modeling: how to estimate and forecast the volatility that we have seen is such a dominant feature of financial return data.

← Previous Course Home Next →