Module 03: Fat Tails & Stylized Facts

Universal patterns in financial returns that every statistician should know

Part 1 of 5 Module 03 of 22

← Previous Module 03 of 22 Next →

1. Introduction: Universal Patterns in Financial Returns

In Module 02 we established that returns, not prices, are the proper object of statistical analysis. We also hinted that returns are not normally distributed. This module dives deep into how returns deviate from normality and introduces the stylized facts — a set of empirical regularities that appear across nearly all financial assets, markets, and time periods.

Finance Term

Stylized Facts: Empirical regularities observed so consistently across different markets, asset classes, and historical periods that they are considered universal properties of financial returns. Any realistic model of financial data should reproduce these patterns. The term was popularized by Cont (2001) in a seminal review paper.

As a statistician, you will recognize these as distributional properties and serial dependence structures. What makes them remarkable is their universality: the same patterns appear in equities, currencies, commodities, and interest rates, across New York, London, Tokyo, and Mumbai, from the 1920s to today.

Stats Bridge

Think of stylized facts as the financial equivalent of the empirical laws you encounter in other fields: Zipf’s law in linguistics, Benford’s law in accounting, or Taylor’s law in ecology. They are not derived from first principles but are so robustly observed that they serve as constraints on any credible model.

2. Heavy Tails: Extreme Events Are Not Rare

2.1 Kurtosis: Measuring Tail Heaviness

The kurtosis of a distribution measures the weight of its tails relative to the normal distribution. The normal distribution has a kurtosis of 3 (or an excess kurtosis of 0). Financial returns typically have excess kurtosis between 5 and 50 for daily data.

Excess Kurtosis = E[(X − μ)⁴] / σ⁴ − 3

An excess kurtosis of 10 does not mean the distribution is “slightly” non-normal. It means the probability of a 5-sigma event is orders of magnitude larger than the normal distribution predicts.

Pythonimport yfinance as yf
import numpy as np
import pandas as pd
import scipy.stats as stats

# Download several assets to show universality
tickers = {
    "S&P 500": "^GSPC",
    "Apple": "AAPL",
    "Euro/USD": "EURUSD=X",
    "Gold": "GC=F",
    "10Y Treasury": "^TNX",
    "Bitcoin": "BTC-USD",
}

results = []
for name, ticker in tickers.items():
    data = yf.download(ticker, start="2015-01-01", end="2025-01-01", progress=False)
    ret = np.log(data["Adj Close"] / data["Adj Close"].shift(1)).dropna()
    results.append({
        "Asset": name,
        "N": len(ret),
        "Mean (bps)": ret.mean() * 10000,
        "Std (%)": ret.std() * 100,
        "Skewness": ret.skew(),
        "Excess Kurtosis": ret.kurtosis(),
        "Min (%)": ret.min() * 100,
        "Max (%)": ret.max() * 100,
    })

df = pd.DataFrame(results)
print(df.to_string(index=False))
# Notice: EVERY asset has excess kurtosis >> 0

2.2 How Much More Likely Are Extreme Events?

Under a normal distribution, a daily return exceeding 4 standard deviations should occur roughly once every 126 years. In practice, it happens roughly once or twice per year. This table quantifies the discrepancy:

Event Size	Normal Prob (one-tail)	Expected Freq (daily)	Actual Freq (S&P 500)	Ratio
3σ move	0.135%	Once per 3 years	Several per year	~5–10x
4σ move	0.003%	Once per 126 years	Once per 1–2 years	~100x
5σ move	0.00003%	Once per 13,000 years	Once per 5–10 years	~1,000x
6σ move	~10⁻⁹	Once per 1.4 million years	Several times in 100 years	~10,000x+

Key Insight

The Black Monday crash of October 19, 1987 was a 22-sigma event under the normal distribution. The probability of a 22-sigma event is approximately 10⁻¹⁰⁷. There are only about 10¹⁷ seconds in the age of the universe. The normal distribution does not merely underestimate tail risk — it misses it by a factor that is cosmologically absurd.

2.3 Comparing to the Student-t Distribution

A much better model for the unconditional distribution of returns is the Student-t distribution with low degrees of freedom (ν ≈ 3–7). This naturally produces heavier tails.

Pythonimport matplotlib.pyplot as plt

# Fit a t-distribution to AAPL log returns
aapl = yf.download("AAPL", start="2015-01-01", end="2025-01-01", progress=False)
log_ret = np.log(aapl["Adj Close"] / aapl["Adj Close"].shift(1)).dropna()

# Fit Student-t (returns df, loc, scale)
t_params = stats.t.fit(log_ret)
print(f"Fitted t-distribution: df={t_params[0]:.2f}, loc={t_params[1]:.6f}, scale={t_params[2]:.6f}")

# Compare fits visually
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x = np.linspace(log_ret.min(), log_ret.max(), 500)

# Linear scale
axes[0].hist(log_ret, bins=150, density=True, alpha=0.6, color='#3182ce', label='Data')
axes[0].plot(x, stats.norm.pdf(x, log_ret.mean(), log_ret.std()),
            'r-', lw=2, label='Normal')
axes[0].plot(x, stats.t.pdf(x, *t_params),
            'g-', lw=2, label=f'Student-t (df={t_params[0]:.1f})')
axes[0].set_title('Density Comparison (Linear Scale)')
axes[0].legend()

# Log scale — crucial for seeing tail behavior
axes[1].hist(log_ret, bins=200, density=True, alpha=0.6, color='#3182ce', label='Data')
axes[1].plot(x, stats.norm.pdf(x, log_ret.mean(), log_ret.std()),
            'r-', lw=2, label='Normal')
axes[1].plot(x, stats.t.pdf(x, *t_params),
            'g-', lw=2, label=f'Student-t (df={t_params[0]:.1f})')
axes[1].set_yscale('log')
axes[1].set_title('Density Comparison (Log Scale — Shows Tails)')
axes[1].legend()

plt.tight_layout()
plt.show()

Stats Bridge

The Student-t distribution is a natural choice because you already know it from hypothesis testing with small samples. When ν → ∞, the t-distribution converges to the normal. When ν is small (3–7), the tails are much heavier. A t-distribution with ν = 4 has infinite fourth moment (kurtosis), which means the sample kurtosis estimator is inconsistent — another reason why moment-based tests can behave oddly with financial data.

3. QQ Plots: The Visual Signature of Fat Tails

3.1 Reading a QQ Plot

The quantile-quantile (QQ) plot is the single most informative diagnostic for distributional assumptions. It plots the quantiles of your data against the quantiles of a theoretical distribution. Points on the 45-degree line mean perfect agreement.

S-shaped deviation: Both tails deviate upward on the right and downward on the left — this is the signature of heavy tails (leptokurtosis).
Concave or convex pattern: Indicates skewness.
Staircase pattern: Indicates discreteness in the data.

3.2 QQ Plots Against Different Reference Distributions

Pythonfig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. QQ plot against Normal
stats.probplot(log_ret, dist="norm", plot=axes[0])
axes[0].set_title('QQ Plot: Data vs Normal')
axes[0].get_lines()[0].set_markerfacecolor('#3182ce')
axes[0].get_lines()[0].set_markersize(2)

# 2. QQ plot against t-distribution with fitted df
stats.probplot(log_ret, dist=stats.t, sparams=(t_params[0],), plot=axes[1])
axes[1].set_title(f'QQ Plot: Data vs t(df={t_params[0]:.1f})')
axes[1].get_lines()[0].set_markerfacecolor('#38a169')
axes[1].get_lines()[0].set_markersize(2)

# 3. QQ plot against Laplace (double exponential)
stats.probplot(log_ret, dist=stats.laplace, plot=axes[2])
axes[2].set_title('QQ Plot: Data vs Laplace')
axes[2].get_lines()[0].set_markerfacecolor('#d69e2e')
axes[2].get_lines()[0].set_markersize(2)

plt.tight_layout()
plt.show()

# The normal QQ will show clear S-shaped deviation
# The t-distribution QQ will be much closer to the diagonal
# The Laplace may overshoot the tails slightly

Key Insight

The QQ plot against the normal distribution is the “classic financial plot.” It is included in virtually every empirical finance paper and every risk management report. The S-shaped deviation in the tails is so characteristic that an experienced analyst can identify financial return data from the QQ plot alone. If your QQ plot does not show this pattern, double-check your data.

4. Volatility Clustering: Large Moves Follow Large Moves

4.1 The Phenomenon

If you plot the time series of daily returns, you will notice something striking: large moves (positive or negative) tend to cluster together, and calm periods tend to cluster together. This is called volatility clustering.

Mandelbrot described it: large changes tend to be followed by large changes — of either sign — and small changes tend to be followed by small changes.

Stats Bridge

Volatility clustering means the returns process is conditionally heteroscedastic. While the unconditional variance may be constant (stationary returns), the conditional variance — the variance at time t given information up to t−1 — varies over time. This is exactly the kind of structure that ARCH/GARCH models are designed to capture. In statistical terms, the series is a martingale difference sequence but not an i.i.d. sequence: E[r_t | F_t-1] ≈ 0, but Var(r_t | F_t-1) is time-varying.

4.2 Visualizing Volatility Clustering

Pythonfig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Panel 1: Returns
axes[0].plot(log_ret.index, log_ret.values, color='#1a365d', linewidth=0.4)
axes[0].set_ylabel('Log Return')
axes[0].set_title('AAPL: Returns and Volatility Clustering')
axes[0].axhline(y=0, color='gray', linestyle='--', linewidth=0.5)

# Panel 2: Absolute returns (proxy for volatility)
axes[1].plot(log_ret.index, log_ret.abs().values, color='#e53e3e', linewidth=0.4)
axes[1].set_ylabel('|Log Return|')
axes[1].set_title('Absolute Returns (Volatility Proxy)')

# Panel 3: Squared returns (another volatility proxy)
axes[2].plot(log_ret.index, (log_ret ** 2).values, color='#d69e2e', linewidth=0.4)
axes[2].set_ylabel('Return²')
axes[2].set_title('Squared Returns (Variance Proxy)')

plt.tight_layout()
plt.show()

# The clustering in panels 2 and 3 is visually obvious

4.3 ARCH Effects: Testing for Volatility Clustering

The Engle ARCH test formally tests whether squared residuals exhibit serial correlation — the statistical signature of volatility clustering.

Pythonfrom statsmodels.stats.diagnostic import het_arch

# Engle's ARCH test on residuals (or demeaned returns)
demeaned = log_ret - log_ret.mean()
arch_test = het_arch(demeaned, nlags=10)

print("=== Engle's ARCH Test ===")
print(f"LM statistic: {arch_test[0]:.4f}")
print(f"p-value:      {arch_test[1]:.2e}")
print(f"F-statistic:  {arch_test[2]:.4f}")
print(f"F p-value:    {arch_test[3]:.2e}")
print(f"Conclusion:   {'ARCH effects present' if arch_test[1] < 0.05 else 'No ARCH effects'}")
# Will almost certainly show significant ARCH effects

Common Pitfall

Volatility clustering means that standard OLS standard errors on financial return regressions are wrong, even if the returns themselves are stationary. The conditional heteroscedasticity violates the homoscedasticity assumption. You need either HAC standard errors (Newey-West) or explicit GARCH modeling to get correct inference.

5. Autocorrelation: Returns vs Absolute Returns

5.1 The Key Stylized Fact

Returns themselves show little or no serial correlation. This is consistent with weak-form market efficiency. However, absolute returns and squared returns show strong, persistent positive autocorrelation that decays slowly. This is the autocorrelation signature of volatility clustering.

Stats Bridge

This dual autocorrelation structure is the hallmark of a white noise process viewed through a stochastic volatility filter. If r_t = σ_t · z_t where z_t is i.i.d. and σ_t is a slowly varying process, then:

Corr(r_t, r_t-k) ≈ 0 (because the z_t are independent)
Corr(|r_t|, |r_t-k|) > 0 (because σ_t is persistent)

This is precisely the structure that GARCH(1,1) and stochastic volatility models are designed to capture.

5.2 Autocorrelation Plots

Pythonfrom statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import acf

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# ACF of returns
plot_acf(log_ret, lags=50, ax=axes[0, 0], title='ACF of Returns',
         color='#1a365d', alpha=0.05)

# ACF of absolute returns
plot_acf(log_ret.abs(), lags=50, ax=axes[0, 1], title='ACF of |Returns|',
         color='#e53e3e', alpha=0.05)

# ACF of squared returns
plot_acf(log_ret ** 2, lags=50, ax=axes[1, 0], title='ACF of Returns²',
         color='#d69e2e', alpha=0.05)

# Compare ACF decay: absolute vs squared
acf_abs = acf(log_ret.abs(), nlags=100)
acf_sq = acf(log_ret ** 2, nlags=100)
axes[1, 1].plot(acf_abs, label='|Returns|', color='#e53e3e')
axes[1, 1].plot(acf_sq, label='Returns²', color='#d69e2e')
axes[1, 1].set_title('ACF Comparison: |r| vs r²')
axes[1, 1].set_xlabel('Lag')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

# Key observation: |returns| has HIGHER autocorrelation than returns²
# This is called the "Taylor effect" — another stylized fact

5.3 The Ljung-Box Test for Serial Correlation

Pythonfrom statsmodels.stats.diagnostic import acorr_ljungbox

# Test serial correlation in returns
lb_returns = acorr_ljungbox(log_ret, lags=[5, 10, 20], return_df=True)
print("=== Ljung-Box Test: Returns ===")
print(lb_returns)

print()

# Test serial correlation in squared returns
lb_squared = acorr_ljungbox(log_ret ** 2, lags=[5, 10, 20], return_df=True)
print("=== Ljung-Box Test: Squared Returns ===")
print(lb_squared)

# Returns: may or may not show significance
# Squared returns: will almost certainly show HIGHLY significant autocorrelation

5.4 Long Memory in Volatility

The autocorrelation of absolute returns decays very slowly — much slower than the exponential decay of a GARCH(1,1) model. This is evidence of long memory in volatility, sometimes modeled with fractional integration (FIGARCH) or hyperbolic decay functions.

Key Insight

The autocorrelation of |r_t| remains statistically significant at lags of 100 or more trading days — that is nearly half a year. This persistence means that a period of high volatility reliably predicts continued high volatility for months, not just days. This has profound implications for risk management: if you are in a high-volatility regime, you should expect it to persist.

6. Formal Normality Tests for Financial Data

6.1 Overview of Tests

Test	Tests What	Best For	Limitations
Jarque-Bera	Skewness + kurtosis jointly	Large samples, financial data	Only uses 3rd and 4th moments
Shapiro-Wilk	Overall distributional shape	Small to medium samples (n < 5000)	Not valid for n > 5000
Anderson-Darling	Tail behavior (more weight on tails)	Detecting tail deviations	More conservative than other tests
Lilliefors	Overall CDF comparison	When parameters are estimated	Less powerful than AD for tails
D’Agostino-Pearson	Skewness + kurtosis omnibus	General purpose	Requires n > 20

6.2 Running All Tests

Python# Comprehensive normality testing
print("=== Normality Tests on AAPL Log Returns ===")
print(f"Sample size: {len(log_ret)}\n")

# 1. Jarque-Bera (ideal for financial data)
jb_stat, jb_p = stats.jarque_bera(log_ret)
print(f"Jarque-Bera:      stat={jb_stat:>10.2f}, p={jb_p:.2e}")

# 2. D'Agostino-Pearson omnibus
dp_stat, dp_p = stats.normaltest(log_ret)
print(f"D'Agostino-K2:    stat={dp_stat:>10.2f}, p={dp_p:.2e}")

# 3. Anderson-Darling
ad_result = stats.anderson(log_ret, dist='norm')
print(f"Anderson-Darling: stat={ad_result.statistic:>10.4f}")
for sl, cv in zip(ad_result.significance_level, ad_result.critical_values):
    reject = "REJECT" if ad_result.statistic > cv else "fail to reject"
    print(f"  At {sl}% level: critical={cv:.4f} -> {reject}")

# 4. Shapiro-Wilk (subsample for large n)
sw_sample = log_ret.sample(min(5000, len(log_ret)), random_state=42)
sw_stat, sw_p = stats.shapiro(sw_sample)
print(f"Shapiro-Wilk:     stat={sw_stat:>10.6f}, p={sw_p:.2e} (n={len(sw_sample)})")

# 5. Lilliefors (KS test with estimated parameters)
from statsmodels.stats.diagnostic import lilliefors
lf_stat, lf_p = lilliefors(log_ret, dist='norm')
print(f"Lilliefors:       stat={lf_stat:>10.6f}, p={lf_p:.4f}")

print(f"\nAll tests will reject normality. This is not surprising — it is a fact.")

Common Pitfall

Do not use the standard Kolmogorov-Smirnov test (stats.kstest) for normality testing when you estimate μ and σ from the data. The KS test assumes the parameters are known a priori; estimating them from the same sample makes the test overly conservative (reduced power). Use the Lilliefors modification instead, which corrects for parameter estimation.

7. The Leverage Effect: Asymmetric Volatility

7.1 What Is the Leverage Effect?

The leverage effect is the empirical observation that negative returns tend to increase future volatility more than positive returns of the same magnitude. When a stock drops 5%, subsequent volatility tends to be higher than after a 5% gain.

Finance Term

Leverage Effect: The asymmetric relationship between returns and volatility. Named because a decline in stock price increases the firm’s financial leverage (debt-to-equity ratio), theoretically making the stock riskier. Whether this financial explanation is correct remains debated, but the empirical pattern is robust.

7.2 Measuring the Leverage Effect

Python# Cross-correlation between returns and future squared returns
# Negative returns at time t should predict higher squared returns at t+1, t+2, ...

max_lag = 20
leverage_corr = []
for k in range(1, max_lag + 1):
    corr = log_ret.corr(log_ret.shift(-k) ** 2)
    leverage_corr.append(corr)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: cross-correlation (leverage effect)
axes[0].bar(range(1, max_lag + 1), leverage_corr, color='#e53e3e')
axes[0].set_xlabel('Lag k (days ahead)')
axes[0].set_ylabel('Corr(r_t, r²_{t+k})')
axes[0].set_title('Leverage Effect: Return vs Future Volatility')
axes[0].axhline(y=0, color='gray', linestyle='--')

# Right: scatter of returns vs next-day absolute returns
axes[1].scatter(log_ret.iloc[:-1].values, log_ret.abs().iloc[1:].values,
               alpha=0.15, s=5, color='#1a365d')
axes[1].set_xlabel('Return today')
axes[1].set_ylabel('|Return| tomorrow')
axes[1].set_title('Asymmetric Volatility Response')

# Add binned means to show the asymmetry
bins = pd.qcut(log_ret.iloc[:-1], 20)
binned_mean = log_ret.abs().iloc[1:].groupby(bins.values).mean()
bin_centers = [interval.mid for interval in binned_mean.index]
axes[1].plot(bin_centers, binned_mean.values, 'r-o', linewidth=2, markersize=4)

plt.tight_layout()
plt.show()
# The cross-correlations should be negative (negative return -> higher future vol)
# The scatter should show a "smirk" — higher abs returns following negative returns

Stats Bridge

The leverage effect introduces asymmetry in the volatility dynamics. In GARCH terms, a standard GARCH(1,1) model treats positive and negative shocks symmetrically. The GJR-GARCH and EGARCH models extend this to allow different volatility responses to positive vs negative returns. This is analogous to modeling heterogeneous treatment effects in clinical trials: the “treatment” (a price shock) has different effects depending on its sign.

8. Summary: The Complete List of Stylized Facts

Here is the canonical list of stylized facts, each mapped to its statistical interpretation:

#	Stylized Fact	Statistical Description	Implication
1	Heavy tails	Excess kurtosis >> 0; power-law tails	Normal distribution underestimates extreme events
2	Absence of return autocorrelation	ACF(r_t) ≈ 0 for k ≥ 1	Returns are approximately unpredictable
3	Volatility clustering	ACF(\|r_t\|) > 0, slowly decaying	Conditional heteroscedasticity; need GARCH
4	Leverage effect	Corr(r_t, σ²_t+1) < 0	Negative returns increase volatility more
5	Volume-volatility correlation	Corr(V_t, \|r_t\|) > 0	High volume accompanies large price moves
6	Aggregational Gaussianity	Kurtosis decreases at longer horizons	CLT applies; monthly returns are more normal than daily
7	Taylor effect	ACF(\|r_t\|) > ACF(r_t²)	Absolute returns are more predictable than squared
8	Gain/loss asymmetry	Negative skewness in stock returns	Large drops more likely than equivalent large gains
9	Long memory in volatility	ACF of \|r_t\| decays hyperbolically, not exponentially	GARCH(1,1) may be insufficient; consider FIGARCH

Key Insight

Any model you build for financial returns should reproduce at least stylized facts 1–4. A model that assumes i.i.d. normal returns violates all of them. The GARCH(1,1) model with t-distributed errors captures facts 1–3. Adding an asymmetric component (GJR-GARCH or EGARCH) captures fact 4 as well. This is why GARCH models are the workhorses of financial econometrics.

In the next module, we will turn to the multivariate structure of financial data: how correlations between assets behave, why they break down during crises, and why Pearson correlation is often the wrong tool for the job.

← Previous Course Home Next →