Save or print this lesson:

Lesson 1: Introduction to Linear Regression

Understanding bivariate data, scatter plots, and correlation

Home > Intro Stats > Module 11 > Lesson 1

Learning Objectives

By the end of this lesson, you will be able to:

Understand bivariate quantitative data
Create and interpret scatter plots
Identify patterns in scatter plots (direction, form, strength)
Calculate and interpret the correlation coefficient (r)
Understand the properties and limitations of correlation
Distinguish between correlation and causation

1. Bivariate Quantitative Data

So far in this course, we've primarily analyzed one variable at a time (univariate data): the distribution of test scores, the average height of students, or the proportion who prefer option A. But many interesting questions involve two variables:

Does studying more hours lead to higher test scores?
Is there a relationship between height and weight?
Do older cars tend to have lower resale values?
Is there a connection between temperature and ice cream sales?

Bivariate Data

Bivariate data consists of pairs of observations on two quantitative variables for each individual or case.

Example: (height, weight) for each person: (65 inches, 140 lbs), (72 inches, 180 lbs), etc.

When we have bivariate data, we often want to know:

Is there a relationship between the two variables?
How strong is that relationship?
What direction is the relationship (positive or negative)?
Can we predict one variable from the other?

Explanatory vs Response Variable

When analyzing relationships, we often have:

Variable Type	Also Called	Symbol	Description
Explanatory	Independent, Predictor	x	The variable we use to explain or predict
Response	Dependent, Outcome	y	The variable we're trying to explain or predict

Example: If we want to predict test scores (y) from study hours (x):

x = study hours (explanatory - we use this to predict)
y = test score (response - this is what we're predicting)

2. Scatter Plots

The first step in analyzing bivariate data is to visualize it with a scatter plot.

Scatter Plot

A scatter plot is a graph that displays bivariate data as points in a coordinate system:

x-axis (horizontal): Explanatory variable
y-axis (vertical): Response variable
Each point represents one individual/case: (x, y)

Example: Study Hours vs Test Scores

Describing Patterns in Scatter Plots

When examining a scatter plot, we look for three characteristics:

Characteristic	What to Look For	Examples
1. Direction	Positive, Negative, or No Association	Positive: As x increases, y increases Negative: As x increases, y decreases None: No clear pattern
2. Form	Linear or Non-linear	Linear: Points follow a straight-line pattern Curved: Points follow a curved pattern No form: Random scatter
3. Strength	How closely points follow the pattern	Strong: Points cluster tightly around pattern Moderate: Points somewhat scattered Weak: Points widely scattered

Correlation Patterns

Example: Describing Scatter Plots

Scenario 1: Height (x) vs Weight (y)

Description: The scatter plot shows a positive, linear, moderately strong relationship. As height increases, weight tends to increase. Points follow a roughly straight-line pattern with some scatter.

Scenario 2: Age of Car (x) vs Resale Value (y)

Description: The scatter plot shows a negative, linear, strong relationship. As the age of the car increases, its resale value tends to decrease. Points cluster tightly around a downward-sloping pattern.

Scenario 3: Shoe Size (x) vs IQ (y)

Description: The scatter plot shows no association. There is no clear pattern or relationship between shoe size and IQ. Points are randomly scattered.

3. The Correlation Coefficient (r)

While scatter plots let us see relationships, the correlation coefficient gives us a numerical measure of the strength and direction of a linear relationship.

Correlation Coefficient (r)

The correlation coefficient, denoted r, measures the strength and direction of the linear relationship between two quantitative variables.

Properties:

Range: -1 ≤ r ≤ 1 (always between -1 and 1)
r = 1: Perfect positive linear relationship (all points exactly on an upward line)
r = -1: Perfect negative linear relationship (all points exactly on a downward line)
r = 0: No linear relationship
Sign: Positive r = positive association; Negative r = negative association
Magnitude: |r| closer to 1 = stronger relationship; |r| closer to 0 = weaker relationship

Interpreting r Values

r Value	Interpretation	Example
r = 1.0	Perfect positive linear relationship	Celsius and Fahrenheit temperature
0.7 ≤ r < 1.0	Strong positive linear relationship	Height and weight (r ≈ 0.80)
0.3 ≤ r < 0.7	Moderate positive linear relationship	Study hours and test scores (r ≈ 0.50)
0 < r < 0.3	Weak positive linear relationship	Age and height in adults (r ≈ 0.10)
r = 0	No linear relationship	Shoe size and IQ
-0.3 < r < 0	Weak negative linear relationship	Age and reflexes (r ≈ -0.20)
-0.7 < r ≤ -0.3	Moderate negative linear relationship	Distance from equator and temperature (r ≈ -0.60)
-1.0 < r ≤ -0.7	Strong negative linear relationship	Car age and resale value (r ≈ -0.85)
r = -1.0	Perfect negative linear relationship	Speed and time to destination (fixed distance)

Formula for Correlation Coefficient

The correlation coefficient is calculated as:

r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]

Where:

x̄ = mean of x values
ȳ = mean of y values
Σ = sum across all data points

Note: In practice, we use technology (calculators, software) to compute r. The formula helps us understand what r measures: how x and y vary together relative to their individual variations.

Example: Calculating Correlation

Let's calculate the correlation between study hours and test scores for 5 students:

Student	Study Hours (x)	Test Score (y)
A	2	65
B	4	75
C	6	80
D	8	90
E	10	95

Step 1: Calculate means

x̄ = (2 + 4 + 6 + 8 + 10) / 5 = 30 / 5 = 6 hours
ȳ = (65 + 75 + 80 + 90 + 95) / 5 = 405 / 5 = 81 points

Step 2: Calculate deviations and products

x	y	(x - x̄)	(y - ȳ)	(x - x̄)(y - ȳ)	(x - x̄)²	(y - ȳ)²
2	65	-4	-16	64	16	256
4	75	-2	-6	12	4	36
6	80	0	-1	0	0	1
8	90	2	9	18	4	81
10	95	4	14	56	16	196
Sums:				150	40	570

Step 3: Apply the formula

r = 150 / √(40 × 570) = 150 / √22,800 = 150 / 151.0 ≈ 0.993

Interpretation: r = 0.993 indicates a very strong positive linear relationship between study hours and test scores. More study hours are strongly associated with higher test scores.

4. Properties and Limitations of Correlation

Important Properties of r

Properties of the Correlation Coefficient

Unitless: r has no units (not hours, pounds, etc.). This allows us to compare correlations across different contexts.
Symmetric: The correlation of x with y equals the correlation of y with x. r(x,y) = r(y,x).
Measures LINEAR relationships only: r only detects straight-line patterns. A curved relationship might have r ≈ 0 even if there's a strong non-linear relationship.
Sensitive to outliers: A single extreme point can dramatically change r.
Not resistant: Outliers can inflate or deflate r, giving a misleading picture of the relationship.

Correlation Measures LINEAR Relationships Only

A correlation close to 0 means no LINEAR relationship, but there could still be a strong non-linear (curved) relationship!

Example: The relationship between height and age in children (ages 0-18) is strongly curved, but might have a moderate linear correlation.

Correlation Does NOT Imply Causation

This is perhaps the most important concept in all of statistics:

CORRELATION ≠ CAUSATION

Just because two variables are correlated does NOT mean one causes the other!

Example 1: Ice cream sales and drowning deaths are positively correlated. Does ice cream cause drowning? No! Both increase in summer due to a third variable: hot weather (confounding variable).

Example 2: Number of firefighters at a fire and amount of fire damage are positively correlated. Do firefighters cause damage? No! Bigger fires require more firefighters AND cause more damage.

Example 3: Shoe size and reading ability are positively correlated in children. Do bigger feet cause better reading? No! Both increase with age.

Three possible explanations for correlation:

x causes y: Study hours → higher test scores (plausible)
y causes x: Higher test scores → more studying (reverse causation)
Third variable (confounding): Both x and y are caused by something else (intelligence → more studying AND higher scores)

To establish causation, you need:

A well-designed randomized controlled experiment
Random assignment to treatment and control groups
Controlled conditions to eliminate confounding

Check Your Understanding

Question 1: A researcher finds r = -0.75 between hours of TV watched per week and GPA. What does this tell us?

Answer: This indicates a strong negative linear relationship. Students who watch more TV tend to have lower GPAs, and vice versa. The relationship is fairly strong (|r| = 0.75 is close to 1).

Important: This does NOT mean watching TV causes lower GPAs! There could be confounding variables (time management skills, interest in academics, etc.).

Question 2: If r = 0.40 between height and salary, can we conclude that being taller causes higher salary?

Answer: No! Correlation does not imply causation. While there's a moderate positive correlation, this could be due to:

Confounding variables: Gender (men tend to be taller and historically have earned more)
Confounding variables: Occupation (certain high-paying jobs may favor taller individuals, like professional basketball)
Coincidence: Weak correlations (r = 0.40 is moderate) could arise by chance

To establish causation, we'd need experimental evidence, not just correlation.

Question 3: A scatter plot shows a strong curved pattern, but r = 0.05. Is there a relationship between the variables?

Answer: Yes, there IS a relationship! The correlation coefficient r only measures LINEAR relationships. A value near 0 means no linear relationship, but there can still be a strong non-linear (curved) relationship.

Lesson: Always look at the scatter plot! Don't rely on r alone.

← Back to Module 11 Next: Lesson 2 - Regression Equation →