Module 01: From Datasets to Markets

Mapping the financial data landscape onto the statistical concepts you already know

Part 1 of 5 Module 01 of 22

← Previous Module 01 of 22 Next →

1. Introduction: Your First Financial Dataset

As a statistician, you have spent years working with datasets from clinical trials, surveys, experiments, and simulations. Financial data shares many of the same structures you already know — tabular observations, time-indexed measurements, missing values — but it comes with its own vocabulary, conventions, and quirks that can trip up even experienced analysts.

This module maps the financial data landscape onto the statistical concepts you already understand. By the end, you will be able to pull live market data in Python, understand what every column means, and recognize the structural differences between financial time series and the kinds of data you have analyzed before.

Stats Bridge

Think of a stock ticker as a subject ID in a longitudinal study. Just as patient #4023 has repeated measurements over time, ticker AAPL has daily price observations stretching back decades. The “study” is the market itself — an observational study with no control group.

We will move through six major topics: what tickers are and how they work, the OHLCV data format, the peculiar time structure of market data, where to get financial data programmatically, the different types of financial datasets, and a hands-on walkthrough of downloading and exploring real stock data.

2. Tickers: The Naming Convention of Finance

2.1 What Is a Ticker?

A ticker symbol (or just “ticker”) is a short alphanumeric code assigned to a publicly traded security. It is the financial world’s equivalent of a variable name or a primary key in a database.

Finance Term

Ticker Symbol: A unique identifier for a traded asset on a particular exchange. Examples: AAPL (Apple Inc.), MSFT (Microsoft), TSLA (Tesla). Think of it as the subject_id column in your panel data.

2.2 Ticker Naming Conventions

Different exchanges and asset classes follow different naming patterns. Here is a reference table:

Asset Class	Example Ticker	Meaning	Statistical Analogue
US Stock	`AAPL`	Apple Inc. on NASDAQ	A single subject in a panel study
US Index	`^GSPC`	S&P 500 Index	A population-level aggregate statistic
Futures Contract	`GC=F`	Gold futures (front month)	A derived variable with an expiration date
Currency Pair	`EURUSD=X`	Euro to US Dollar exchange rate	A ratio of two measurements
Cryptocurrency	`BTC-USD`	Bitcoin priced in USD	A 24/7 observed process (no gaps)
ETF	`SPY`	S&P 500 ETF Trust	A portfolio (weighted composite variable)
Mutual Fund	`VFIAX`	Vanguard 500 Index Fund	End-of-day only (one observation per day)

2.3 Special Ticker Prefixes and Suffixes

The ^ prefix typically indicates an index (a computed aggregate, not a directly tradeable security). The =F suffix marks a futures contract. The =X suffix marks a currency exchange rate. These conventions vary by data provider, so always check the documentation.

Key Insight

An index like ^GSPC (S&P 500) is a sufficient statistic of the market. You cannot buy it directly — you buy an ETF like SPY that tracks it. The distinction matters: ETFs have tracking error, management fees, and their own supply/demand dynamics.

2.4 Ticker Pitfalls

Tickers are not permanent. Companies change tickers after mergers (e.g., Facebook changed from FB to META). Delisted companies lose their tickers, and new companies may reuse old ones. This creates an identification problem analogous to changing subject IDs mid-study.

Common Pitfall

Never assume a ticker uniquely identifies a company across all time. The ticker TWTR belonged to Twitter until it was delisted in 2022. If a new company is assigned that ticker years later, historical lookups will mix two completely different entities. Always cross-reference with a permanent identifier like CUSIP or ISIN for longitudinal studies.

3. OHLCV: The Standard Observation Format

3.1 What OHLCV Means

Most financial datasets are delivered in OHLCV format. Each row represents a single time period (usually one trading day), and the columns are:

Column	Definition	Statistical Interpretation
Open	Price at the start of the period	First observation in an intra-period sample
High	Maximum price during the period	Sample maximum — an order statistic
Low	Minimum price during the period	Sample minimum — an order statistic
Close	Price at the end of the period	Last observation in an intra-period sample
Volume	Number of shares/contracts traded	Sample size for that period (activity level)

Stats Bridge

Each OHLCV row is a summary statistic of all the trades that happened during that period. If a stock traded 50,000 times in a day, you are seeing five numbers that summarize those 50,000 observations: the first, the max, the min, the last, and the count. This is extreme data compression, and it means you are always working with aggregated data.

3.2 Adjusted Close

Many data sources include a sixth column: Adjusted Close (or Adj Close). This is the closing price retroactively modified to account for stock splits and dividend payments.

Finance Term

Stock Split: When a company divides its existing shares into multiple new shares. A 4-for-1 split means each $400 share becomes four $100 shares. Statistically, it is a change of units — like converting kilograms to grams. The Adjusted Close corrects for this so your time series remains on a consistent scale.

When computing returns, always use Adjusted Close, not the raw Close price. Using raw Close will introduce artificial jumps at every split and dividend date, corrupting your analysis.

3.3 The High-Low Range as a Volatility Estimator

The difference between High and Low within a single period is a classical volatility estimator. The Parkinson estimator uses this range:

σ²_Parkinson = (1 / 4 ln 2) · (ln H_t − ln L_t)²

This is more efficient than the standard close-to-close estimator because it uses more information from each period. In statistics terms, you are using sufficient statistics (the range) rather than just one endpoint.

3.4 Volume: The Overlooked Dimension

Volume tells you how many shares were traded. High volume with a price change signals conviction; low volume with a price change may be noise. Think of volume as the sample size for each observation: a mean computed from 10 million trades is more reliable than one computed from 10 trades.

Key Insight

Volume-weighted average price (VWAP) is the financial equivalent of a weighted mean. It computes the average price weighted by the number of shares at each price level. When a trader says they “beat VWAP,” they mean they bought at a price below the volume-weighted average — a genuinely statistical claim.

4. Time Series Structure: Trading Days, Gaps, and Hours

4.1 Trading Days vs Calendar Days

US stock markets are open Monday through Friday, roughly 9:30 AM to 4:00 PM Eastern Time. They are closed on weekends and about ten federal holidays per year. This means a “daily” financial time series has approximately 252 trading days per year, not 365.

Stats Bridge

This is an irregularly spaced time series when viewed on a calendar basis, but a regularly spaced time series when indexed by trading day number. Most financial analysis uses the trading-day index. If you apply calendar time methods (like a 30-day rolling window), you will get different window sizes depending on whether holidays fall within that window.

4.2 Weekend and Holiday Gaps

The gap between Friday close and Monday open is a 64-hour period compressed into a single observation. Weekend news — earnings announcements, geopolitical events — accumulate and are released as a “jump” at Monday’s open.

Calendar Situation	Gap Duration	Statistical Impact
Weekday to weekday	~17.5 hours	Standard overnight gap
Friday to Monday	~64 hours	3.7x the standard gap; higher variance in returns
Before a 3-day holiday	~88 hours	5x the standard gap; potentially extreme jumps

4.3 Intraday Data and Market Microstructure

If you work with intraday data (1-minute or 5-minute bars), you must handle the overnight gap explicitly. The return from the last bar of one day to the first bar of the next day is not the same as a within-day 5-minute return — it spans a closed-market period where no trading occurred.

4.4 Time Zones

US markets operate on Eastern Time. If you pull data for stocks on multiple exchanges (e.g., London, Tokyo, New York), you must align time zones. The London Stock Exchange closes at 4:30 PM GMT, which is 11:30 AM Eastern — while the US market is still open. Failing to account for this introduces lead-lag artifacts in cross-market correlations.

Common Pitfall

Some data providers return timestamps in UTC, others in the exchange’s local time, and others in your local time. Always verify the timezone of your data before analysis. A one-day misalignment between two series can introduce spurious correlations or destroy real ones.

5. Data Sources: Where to Get Financial Data in Python

5.1 yfinance: Free Yahoo Finance Data

The yfinance library is the most accessible way to pull financial data in Python. It wraps Yahoo Finance’s API and provides OHLCV data, fundamental data, options chains, and more.

Pythonimport yfinance as yf
import pandas as pd

# Download Apple stock data for the past 5 years
aapl = yf.download("AAPL", start="2020-01-01", end="2025-01-01")

# Inspect the structure
print(aapl.shape)          # (n_trading_days, 6)
print(aapl.columns.tolist()) # ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
print(aapl.index.dtype)     # DatetimeIndex

# First few rows
print(aapl.head(10))

5.2 Downloading Multiple Tickers

Python# Download multiple tickers simultaneously
tickers = ["AAPL", "MSFT", "GOOG", "AMZN", "TSLA"]
data = yf.download(tickers, start="2020-01-01", end="2025-01-01")

# Result is a MultiIndex DataFrame
# Level 0: Price type (Open, High, Low, Close, Adj Close, Volume)
# Level 1: Ticker symbol
print(data.columns[:6])

# Extract just the Adjusted Close for all tickers
adj_close = data["Adj Close"]
print(adj_close.head())

Stats Bridge

The multi-ticker download produces a panel dataset (also called longitudinal data): multiple subjects (tickers) observed over multiple time periods. In R’s plm package or Python’s linearmodels, this would be a panel with entity = ticker and time = date.

5.3 FRED: Federal Reserve Economic Data

The Federal Reserve Bank of St. Louis maintains FRED (Federal Reserve Economic Data), one of the most important sources of macroeconomic data. It covers interest rates, GDP, inflation, unemployment, and thousands of other series.

Pythonfrom pandas_datareader import data as pdr

# Pull the 10-Year Treasury yield from FRED
treasury_10y = pdr.get_data_fred("GS10", start="2000-01-01")

# Pull the Federal Funds Rate
fed_funds = pdr.get_data_fred("FEDFUNDS", start="2000-01-01")

# Pull the Consumer Price Index
cpi = pdr.get_data_fred("CPIAUCSL", start="2000-01-01")

# Note: FRED series come at different frequencies
# GS10 is monthly, FEDFUNDS is monthly, CPI is monthly
# You must align frequencies before merging
print(treasury_10y.head())
print(f"Frequency: {pd.infer_freq(treasury_10y.index)}")

5.4 Quandl / Nasdaq Data Link

Quandl (now Nasdaq Data Link) provides both free and premium datasets covering commodities, futures, economic indicators, and alternative data.

Pythonimport nasdaqdatalink

# Set your API key (free registration required)
nasdaqdatalink.ApiConfig.api_key = "YOUR_API_KEY"

# Pull crude oil futures prices
oil = nasdaqdatalink.get("CHRIS/CME_CL1", start_date="2020-01-01")

# Pull Treasury yield curve data
yields = nasdaqdatalink.get("USTREASURY/YIELD", start_date="2020-01-01")

print(oil.columns.tolist())
print(yields.head())

5.5 Comparison of Data Sources

Source	Cost	Coverage	Frequency	Best For
yfinance	Free	Stocks, ETFs, indices, crypto, options	1m to daily	Quick analysis, prototyping
FRED	Free	Macroeconomic indicators	Daily to annual	Interest rates, GDP, inflation
Nasdaq Data Link	Free + Premium	Commodities, futures, alternative data	Varies	Research-grade data
Alpha Vantage	Free (limited)	Stocks, forex, crypto	1m to daily	Free API with rate limits
Bloomberg	$$$$ (terminal)	Everything	Tick to daily	Professional finance (gold standard)

Common Pitfall

Free data sources like yfinance are excellent for learning and prototyping, but they have limitations: data may be delayed, adjusted prices may not match official records exactly, and historical data for delisted securities may be missing. For published research, use a vetted source like CRSP, Compustat, or Bloomberg.

6. Types of Financial Data

6.1 Cross-Sectional Data

A cross-section in finance is a snapshot of many assets at one point in time. For example: the closing prices of all 500 stocks in the S&P 500 on a single day. This is exactly like a cross-sectional survey — many subjects, one time point.

Cross-sectional analysis answers questions like: “Which stocks have the highest price-to-earnings ratio today?” or “Is there a cross-sectional relationship between market capitalization and average return?”

6.2 Time Series Data

A time series in finance tracks a single asset over many time periods. The daily closing prices of Apple from 2010 to 2025 form a time series. This is equivalent to a repeated-measures study on a single subject.

6.3 Panel Data

Panel data (or longitudinal data) combines both dimensions: multiple assets over multiple time periods. Most serious financial research uses panel data.

Stats Bridge

The classification of financial data mirrors exactly what you learned in your econometrics or longitudinal data analysis course:

Cross-section: N subjects, T = 1 — one snapshot
Time series: N = 1 subject, T periods — one stock over time
Panel: N subjects, T periods — many stocks over time

The panel structure in finance creates the same challenges you already know: within-entity correlation, between-entity heterogeneity, and the need for clustered standard errors.

6.4 Tick Data

Tick data records every individual trade or quote change. A single stock can generate millions of ticks per day. This is the raw, uncompressed version of OHLCV — analogous to having individual patient readings every second versus a daily summary.

Data Type	Granularity	Typical Size (1 stock, 1 year)	Statistical Analogue
Daily OHLCV	1 row per trading day	~252 rows	Daily summary statistics
1-Minute Bars	1 row per minute	~98,000 rows	Minute-level aggregates
Tick Data	1 row per trade	~50,000,000 rows	The raw event stream

6.5 Alternative Data

Modern quantitative finance increasingly uses alternative data: satellite imagery (to count cars in parking lots), credit card transactions, social media sentiment, shipping container movements, and more. These are covariates — additional predictor variables that might explain asset returns beyond what price and volume alone can capture.

Key Insight

The explosion of alternative data is exactly the “big data” phenomenon that statistics has been grappling with: more features (p) than observations (n), the need for regularization, and the danger of overfitting. The same tools you use for high-dimensional regression — LASSO, Ridge, cross-validation — are directly applicable.

7. Hands-On: Exploring Apple Stock Data

7.1 Download and Inspect

Pythonimport yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Download Apple stock data
aapl = yf.download("AAPL", start="2015-01-01", end="2025-01-01")

# Basic shape and info
print(f"Shape: {aapl.shape}")
print(f"Date range: {aapl.index[0]} to {aapl.index[-1]}")
print(f"Trading days: {len(aapl)}")
print(f"Calendar days: {(aapl.index[-1] - aapl.index[0]).days}")
print(f"Ratio: {len(aapl) / (aapl.index[-1] - aapl.index[0]).days:.3f}")
# You'll see roughly 0.69 — about 252/365

7.2 Summary Statistics

Python# Statistical summary of each column
print(aapl.describe())

# Additional statistics a statistician would want
print("\n--- Additional Statistics ---")
print(f"Skewness of Close: {aapl['Close'].skew():.4f}")
print(f"Kurtosis of Close: {aapl['Close'].kurtosis():.4f}")
print(f"Coefficient of Variation: {aapl['Close'].std() / aapl['Close'].mean():.4f}")
print(f"Median Volume: {aapl['Volume'].median():,.0f}")
print(f"IQR of Volume: {aapl['Volume'].quantile(0.75) - aapl['Volume'].quantile(0.25):,.0f}")

7.3 Check for Missing Values

Python# Missing values check
print("Missing values per column:")
print(aapl.isnull().sum())

# Check for gaps in trading days
date_diffs = aapl.index.to_series().diff().dt.days
print(f"\nDate gap statistics:")
print(date_diffs.describe())

# Find the longest gaps (holidays + weekends)
long_gaps = date_diffs[date_diffs > 3].sort_values(ascending=False)
print(f"\nGaps longer than 3 days (holidays):")
print(long_gaps.head(10))

7.4 Visualize the Data

Pythonfig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=True)

# Price chart
axes[0].plot(aapl.index, aapl['Adj Close'], color='#1a365d', linewidth=0.8)
axes[0].set_ylabel('Adjusted Close ($)')
axes[0].set_title('AAPL: Price, Daily Range, and Volume')

# Daily range (High - Low) as a volatility proxy
daily_range = (aapl['High'] - aapl['Low']) / aapl['Close'] * 100
axes[1].bar(aapl.index, daily_range, color='#e53e3e', alpha=0.6, width=1)
axes[1].set_ylabel('Daily Range (%)')

# Volume
axes[2].bar(aapl.index, aapl['Volume'] / 1e6, color='#3182ce', alpha=0.6, width=1)
axes[2].set_ylabel('Volume (millions)')

plt.tight_layout()
plt.savefig('aapl_exploration.png', dpi=150, bbox_inches='tight')
plt.show()

7.5 Understanding the Date Index

Python# The index is a DatetimeIndex — verify its properties
print(f"Index type: {type(aapl.index)}")
print(f"Timezone: {aapl.index.tz}")  # Usually None (timezone-naive)
print(f"Freq: {aapl.index.freq}")    # Usually None (irregular calendar spacing)

# Distribution of day-of-week
dow_counts = aapl.index.dayofweek.value_counts().sort_index()
dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
print("\nObservations by day of week:")
for i, count in dow_counts.items():
    print(f"  {dow_labels[i]}: {count}")
# You should see ~0 for Sat and Sun

7.6 Quick Correlation Check

Python# Correlation matrix of OHLCV columns
corr = aapl[['Open', 'High', 'Low', 'Close', 'Volume']].corr()
print("Correlation matrix:")
print(corr.round(3))

# Notice: Open, High, Low, Close are nearly perfectly correlated
# This makes sense — they are all prices on the same day
# Volume has a different (often negative) correlation with prices
# This is because OHLC are nonstationary (trending), while volume is more stable

Key Insight

The near-perfect correlation among Open, High, Low, and Close is a classic example of spurious correlation driven by nonstationarity. These four series all contain the same underlying trend. To get meaningful correlations, you need to transform prices into returns first — which is exactly the topic of Module 02.

8. Chapter Summary

Here is what we covered and the key mappings to your statistical knowledge:

Financial Concept	Statistical Analogue
Ticker symbol	Subject/entity identifier
OHLCV row	Summary statistics of intra-period observations
Adjusted Close	Scale-corrected measurement (unit normalization)
Trading days	Irregularly spaced time series on calendar scale
Volume	Sample size / observation weight
Index (e.g., S&P 500)	Population aggregate / sufficient statistic
Panel of stocks	Longitudinal / panel data
Tick data	Raw event-level microdata

In the next module, we will tackle the most fundamental transformation in financial analysis: converting prices into returns. You will learn why raw prices are nearly useless for statistical analysis and how the simple act of differencing (or log-differencing) transforms a nonstationary series into something you can actually model.

← Previous Course Home Next →