Module 01: From Datasets to Markets
Mapping the financial data landscape onto the statistical concepts you already know
1. Introduction: Your First Financial Dataset
As a statistician, you have spent years working with datasets from clinical trials, surveys, experiments, and simulations. Financial data shares many of the same structures you already know — tabular observations, time-indexed measurements, missing values — but it comes with its own vocabulary, conventions, and quirks that can trip up even experienced analysts.
This module maps the financial data landscape onto the statistical concepts you already understand. By the end, you will be able to pull live market data in Python, understand what every column means, and recognize the structural differences between financial time series and the kinds of data you have analyzed before.
AAPL
has daily price observations stretching back decades. The “study” is the
market itself — an observational study with no control group.
We will move through six major topics: what tickers are and how they work, the OHLCV data format, the peculiar time structure of market data, where to get financial data programmatically, the different types of financial datasets, and a hands-on walkthrough of downloading and exploring real stock data.
2. Tickers: The Naming Convention of Finance
2.1 What Is a Ticker?
A ticker symbol (or just “ticker”) is a short alphanumeric code assigned to a publicly traded security. It is the financial world’s equivalent of a variable name or a primary key in a database.
AAPL (Apple Inc.), MSFT (Microsoft),
TSLA (Tesla). Think of it as the subject_id column in your
panel data.
2.2 Ticker Naming Conventions
Different exchanges and asset classes follow different naming patterns. Here is a reference table:
| Asset Class | Example Ticker | Meaning | Statistical Analogue |
|---|---|---|---|
| US Stock | AAPL |
Apple Inc. on NASDAQ | A single subject in a panel study |
| US Index | ^GSPC |
S&P 500 Index | A population-level aggregate statistic |
| Futures Contract | GC=F |
Gold futures (front month) | A derived variable with an expiration date |
| Currency Pair | EURUSD=X |
Euro to US Dollar exchange rate | A ratio of two measurements |
| Cryptocurrency | BTC-USD |
Bitcoin priced in USD | A 24/7 observed process (no gaps) |
| ETF | SPY |
S&P 500 ETF Trust | A portfolio (weighted composite variable) |
| Mutual Fund | VFIAX |
Vanguard 500 Index Fund | End-of-day only (one observation per day) |
2.3 Special Ticker Prefixes and Suffixes
The ^ prefix typically indicates an index (a computed aggregate, not a
directly tradeable security). The =F suffix marks a futures contract.
The =X suffix marks a currency exchange rate. These conventions vary
by data provider, so always check the documentation.
^GSPC (S&P 500) is a
sufficient statistic of the market. You cannot buy it directly —
you buy an ETF like SPY that tracks it. The distinction
matters: ETFs have tracking error, management fees, and their own supply/demand
dynamics.
2.4 Ticker Pitfalls
Tickers are not permanent. Companies change tickers after mergers (e.g., Facebook
changed from FB to META). Delisted companies lose their
tickers, and new companies may reuse old ones. This creates an identification problem
analogous to changing subject IDs mid-study.
TWTR belonged to Twitter until it was delisted in 2022. If a new
company is assigned that ticker years later, historical lookups will mix two
completely different entities. Always cross-reference with a permanent identifier
like CUSIP or ISIN for longitudinal studies.
3. OHLCV: The Standard Observation Format
3.1 What OHLCV Means
Most financial datasets are delivered in OHLCV format. Each row represents a single time period (usually one trading day), and the columns are:
| Column | Definition | Statistical Interpretation |
|---|---|---|
| Open | Price at the start of the period | First observation in an intra-period sample |
| High | Maximum price during the period | Sample maximum — an order statistic |
| Low | Minimum price during the period | Sample minimum — an order statistic |
| Close | Price at the end of the period | Last observation in an intra-period sample |
| Volume | Number of shares/contracts traded | Sample size for that period (activity level) |
3.2 Adjusted Close
Many data sources include a sixth column: Adjusted Close (or
Adj Close). This is the closing price retroactively modified to account
for stock splits and dividend payments.
When computing returns, always use Adjusted Close, not the raw Close price. Using raw Close will introduce artificial jumps at every split and dividend date, corrupting your analysis.
3.3 The High-Low Range as a Volatility Estimator
The difference between High and Low within a single period is a classical volatility estimator. The Parkinson estimator uses this range:
This is more efficient than the standard close-to-close estimator because it uses more information from each period. In statistics terms, you are using sufficient statistics (the range) rather than just one endpoint.
3.4 Volume: The Overlooked Dimension
Volume tells you how many shares were traded. High volume with a price change signals conviction; low volume with a price change may be noise. Think of volume as the sample size for each observation: a mean computed from 10 million trades is more reliable than one computed from 10 trades.
4. Time Series Structure: Trading Days, Gaps, and Hours
4.1 Trading Days vs Calendar Days
US stock markets are open Monday through Friday, roughly 9:30 AM to 4:00 PM Eastern Time. They are closed on weekends and about ten federal holidays per year. This means a “daily” financial time series has approximately 252 trading days per year, not 365.
4.2 Weekend and Holiday Gaps
The gap between Friday close and Monday open is a 64-hour period compressed into a single observation. Weekend news — earnings announcements, geopolitical events — accumulate and are released as a “jump” at Monday’s open.
| Calendar Situation | Gap Duration | Statistical Impact |
|---|---|---|
| Weekday to weekday | ~17.5 hours | Standard overnight gap |
| Friday to Monday | ~64 hours | 3.7x the standard gap; higher variance in returns |
| Before a 3-day holiday | ~88 hours | 5x the standard gap; potentially extreme jumps |
4.3 Intraday Data and Market Microstructure
If you work with intraday data (1-minute or 5-minute bars), you must handle the overnight gap explicitly. The return from the last bar of one day to the first bar of the next day is not the same as a within-day 5-minute return — it spans a closed-market period where no trading occurred.
4.4 Time Zones
US markets operate on Eastern Time. If you pull data for stocks on multiple exchanges (e.g., London, Tokyo, New York), you must align time zones. The London Stock Exchange closes at 4:30 PM GMT, which is 11:30 AM Eastern — while the US market is still open. Failing to account for this introduces lead-lag artifacts in cross-market correlations.
5. Data Sources: Where to Get Financial Data in Python
5.1 yfinance: Free Yahoo Finance Data
The yfinance library is the most accessible way to pull financial data
in Python. It wraps Yahoo Finance’s API and provides OHLCV data, fundamental
data, options chains, and more.
Pythonimport yfinance as yf import pandas as pd # Download Apple stock data for the past 5 years aapl = yf.download("AAPL", start="2020-01-01", end="2025-01-01") # Inspect the structure print(aapl.shape) # (n_trading_days, 6) print(aapl.columns.tolist()) # ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'] print(aapl.index.dtype) # DatetimeIndex # First few rows print(aapl.head(10))
5.2 Downloading Multiple Tickers
Python# Download multiple tickers simultaneously tickers = ["AAPL", "MSFT", "GOOG", "AMZN", "TSLA"] data = yf.download(tickers, start="2020-01-01", end="2025-01-01") # Result is a MultiIndex DataFrame # Level 0: Price type (Open, High, Low, Close, Adj Close, Volume) # Level 1: Ticker symbol print(data.columns[:6]) # Extract just the Adjusted Close for all tickers adj_close = data["Adj Close"] print(adj_close.head())
plm package or Python’s linearmodels,
this would be a panel with entity = ticker and time = date.
5.3 FRED: Federal Reserve Economic Data
The Federal Reserve Bank of St. Louis maintains FRED (Federal Reserve Economic Data), one of the most important sources of macroeconomic data. It covers interest rates, GDP, inflation, unemployment, and thousands of other series.
Pythonfrom pandas_datareader import data as pdr # Pull the 10-Year Treasury yield from FRED treasury_10y = pdr.get_data_fred("GS10", start="2000-01-01") # Pull the Federal Funds Rate fed_funds = pdr.get_data_fred("FEDFUNDS", start="2000-01-01") # Pull the Consumer Price Index cpi = pdr.get_data_fred("CPIAUCSL", start="2000-01-01") # Note: FRED series come at different frequencies # GS10 is monthly, FEDFUNDS is monthly, CPI is monthly # You must align frequencies before merging print(treasury_10y.head()) print(f"Frequency: {pd.infer_freq(treasury_10y.index)}")
5.4 Quandl / Nasdaq Data Link
Quandl (now Nasdaq Data Link) provides both free and premium datasets covering commodities, futures, economic indicators, and alternative data.
Pythonimport nasdaqdatalink # Set your API key (free registration required) nasdaqdatalink.ApiConfig.api_key = "YOUR_API_KEY" # Pull crude oil futures prices oil = nasdaqdatalink.get("CHRIS/CME_CL1", start_date="2020-01-01") # Pull Treasury yield curve data yields = nasdaqdatalink.get("USTREASURY/YIELD", start_date="2020-01-01") print(oil.columns.tolist()) print(yields.head())
5.5 Comparison of Data Sources
| Source | Cost | Coverage | Frequency | Best For |
|---|---|---|---|---|
| yfinance | Free | Stocks, ETFs, indices, crypto, options | 1m to daily | Quick analysis, prototyping |
| FRED | Free | Macroeconomic indicators | Daily to annual | Interest rates, GDP, inflation |
| Nasdaq Data Link | Free + Premium | Commodities, futures, alternative data | Varies | Research-grade data |
| Alpha Vantage | Free (limited) | Stocks, forex, crypto | 1m to daily | Free API with rate limits |
| Bloomberg | $$$$ (terminal) | Everything | Tick to daily | Professional finance (gold standard) |
6. Types of Financial Data
6.1 Cross-Sectional Data
A cross-section in finance is a snapshot of many assets at one point in time. For example: the closing prices of all 500 stocks in the S&P 500 on a single day. This is exactly like a cross-sectional survey — many subjects, one time point.
Cross-sectional analysis answers questions like: “Which stocks have the highest price-to-earnings ratio today?” or “Is there a cross-sectional relationship between market capitalization and average return?”
6.2 Time Series Data
A time series in finance tracks a single asset over many time periods. The daily closing prices of Apple from 2010 to 2025 form a time series. This is equivalent to a repeated-measures study on a single subject.
6.3 Panel Data
Panel data (or longitudinal data) combines both dimensions: multiple assets over multiple time periods. Most serious financial research uses panel data.
- Cross-section: N subjects, T = 1 — one snapshot
- Time series: N = 1 subject, T periods — one stock over time
- Panel: N subjects, T periods — many stocks over time
6.4 Tick Data
Tick data records every individual trade or quote change. A single stock can generate millions of ticks per day. This is the raw, uncompressed version of OHLCV — analogous to having individual patient readings every second versus a daily summary.
| Data Type | Granularity | Typical Size (1 stock, 1 year) | Statistical Analogue |
|---|---|---|---|
| Daily OHLCV | 1 row per trading day | ~252 rows | Daily summary statistics |
| 1-Minute Bars | 1 row per minute | ~98,000 rows | Minute-level aggregates |
| Tick Data | 1 row per trade | ~50,000,000 rows | The raw event stream |
6.5 Alternative Data
Modern quantitative finance increasingly uses alternative data: satellite imagery (to count cars in parking lots), credit card transactions, social media sentiment, shipping container movements, and more. These are covariates — additional predictor variables that might explain asset returns beyond what price and volume alone can capture.
7. Hands-On: Exploring Apple Stock Data
7.1 Download and Inspect
Pythonimport yfinance as yf import pandas as pd import numpy as np import matplotlib.pyplot as plt # Download Apple stock data aapl = yf.download("AAPL", start="2015-01-01", end="2025-01-01") # Basic shape and info print(f"Shape: {aapl.shape}") print(f"Date range: {aapl.index[0]} to {aapl.index[-1]}") print(f"Trading days: {len(aapl)}") print(f"Calendar days: {(aapl.index[-1] - aapl.index[0]).days}") print(f"Ratio: {len(aapl) / (aapl.index[-1] - aapl.index[0]).days:.3f}") # You'll see roughly 0.69 — about 252/365
7.2 Summary Statistics
Python# Statistical summary of each column print(aapl.describe()) # Additional statistics a statistician would want print("\n--- Additional Statistics ---") print(f"Skewness of Close: {aapl['Close'].skew():.4f}") print(f"Kurtosis of Close: {aapl['Close'].kurtosis():.4f}") print(f"Coefficient of Variation: {aapl['Close'].std() / aapl['Close'].mean():.4f}") print(f"Median Volume: {aapl['Volume'].median():,.0f}") print(f"IQR of Volume: {aapl['Volume'].quantile(0.75) - aapl['Volume'].quantile(0.25):,.0f}")
7.3 Check for Missing Values
Python# Missing values check print("Missing values per column:") print(aapl.isnull().sum()) # Check for gaps in trading days date_diffs = aapl.index.to_series().diff().dt.days print(f"\nDate gap statistics:") print(date_diffs.describe()) # Find the longest gaps (holidays + weekends) long_gaps = date_diffs[date_diffs > 3].sort_values(ascending=False) print(f"\nGaps longer than 3 days (holidays):") print(long_gaps.head(10))
7.4 Visualize the Data
Pythonfig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=True) # Price chart axes[0].plot(aapl.index, aapl['Adj Close'], color='#1a365d', linewidth=0.8) axes[0].set_ylabel('Adjusted Close ($)') axes[0].set_title('AAPL: Price, Daily Range, and Volume') # Daily range (High - Low) as a volatility proxy daily_range = (aapl['High'] - aapl['Low']) / aapl['Close'] * 100 axes[1].bar(aapl.index, daily_range, color='#e53e3e', alpha=0.6, width=1) axes[1].set_ylabel('Daily Range (%)') # Volume axes[2].bar(aapl.index, aapl['Volume'] / 1e6, color='#3182ce', alpha=0.6, width=1) axes[2].set_ylabel('Volume (millions)') plt.tight_layout() plt.savefig('aapl_exploration.png', dpi=150, bbox_inches='tight') plt.show()
7.5 Understanding the Date Index
Python# The index is a DatetimeIndex — verify its properties print(f"Index type: {type(aapl.index)}") print(f"Timezone: {aapl.index.tz}") # Usually None (timezone-naive) print(f"Freq: {aapl.index.freq}") # Usually None (irregular calendar spacing) # Distribution of day-of-week dow_counts = aapl.index.dayofweek.value_counts().sort_index() dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'] print("\nObservations by day of week:") for i, count in dow_counts.items(): print(f" {dow_labels[i]}: {count}") # You should see ~0 for Sat and Sun
7.6 Quick Correlation Check
Python# Correlation matrix of OHLCV columns corr = aapl[['Open', 'High', 'Low', 'Close', 'Volume']].corr() print("Correlation matrix:") print(corr.round(3)) # Notice: Open, High, Low, Close are nearly perfectly correlated # This makes sense — they are all prices on the same day # Volume has a different (often negative) correlation with prices # This is because OHLC are nonstationary (trending), while volume is more stable
8. Chapter Summary
Here is what we covered and the key mappings to your statistical knowledge:
| Financial Concept | Statistical Analogue |
|---|---|
| Ticker symbol | Subject/entity identifier |
| OHLCV row | Summary statistics of intra-period observations |
| Adjusted Close | Scale-corrected measurement (unit normalization) |
| Trading days | Irregularly spaced time series on calendar scale |
| Volume | Sample size / observation weight |
| Index (e.g., S&P 500) | Population aggregate / sufficient statistic |
| Panel of stocks | Longitudinal / panel data |
| Tick data | Raw event-level microdata |
In the next module, we will tackle the most fundamental transformation in financial analysis: converting prices into returns. You will learn why raw prices are nearly useless for statistical analysis and how the simple act of differencing (or log-differencing) transforms a nonstationary series into something you can actually model.