Learn Without Walls

Lesson 1: Introduction to Linear Regression

Understanding bivariate data, scatter plots, and correlation

Home > Intro Stats > Module 11 > Lesson 1

Learning Objectives

By the end of this lesson, you will be able to:

1. Bivariate Quantitative Data

So far in this course, we've primarily analyzed one variable at a time (univariate data): the distribution of test scores, the average height of students, or the proportion who prefer option A. But many interesting questions involve two variables:

Bivariate Data

Bivariate data consists of pairs of observations on two quantitative variables for each individual or case.

Example: (height, weight) for each person: (65 inches, 140 lbs), (72 inches, 180 lbs), etc.

When we have bivariate data, we often want to know:

  1. Is there a relationship between the two variables?
  2. How strong is that relationship?
  3. What direction is the relationship (positive or negative)?
  4. Can we predict one variable from the other?

Explanatory vs Response Variable

When analyzing relationships, we often have:

Variable Type Also Called Symbol Description
Explanatory Independent, Predictor x The variable we use to explain or predict
Response Dependent, Outcome y The variable we're trying to explain or predict

Example: If we want to predict test scores (y) from study hours (x):

2. Scatter Plots

The first step in analyzing bivariate data is to visualize it with a scatter plot.

Scatter Plot

A scatter plot is a graph that displays bivariate data as points in a coordinate system:

  • x-axis (horizontal): Explanatory variable
  • y-axis (vertical): Response variable
  • Each point represents one individual/case: (x, y)

Example: Study Hours vs Test Scores

Describing Patterns in Scatter Plots

When examining a scatter plot, we look for three characteristics:

Characteristic What to Look For Examples
1. Direction Positive, Negative, or No Association Positive: As x increases, y increases
Negative: As x increases, y decreases
None: No clear pattern
2. Form Linear or Non-linear Linear: Points follow a straight-line pattern
Curved: Points follow a curved pattern
No form: Random scatter
3. Strength How closely points follow the pattern Strong: Points cluster tightly around pattern
Moderate: Points somewhat scattered
Weak: Points widely scattered

Correlation Patterns

Example: Describing Scatter Plots

Scenario 1: Height (x) vs Weight (y)

Description: The scatter plot shows a positive, linear, moderately strong relationship. As height increases, weight tends to increase. Points follow a roughly straight-line pattern with some scatter.

Scenario 2: Age of Car (x) vs Resale Value (y)

Description: The scatter plot shows a negative, linear, strong relationship. As the age of the car increases, its resale value tends to decrease. Points cluster tightly around a downward-sloping pattern.

Scenario 3: Shoe Size (x) vs IQ (y)

Description: The scatter plot shows no association. There is no clear pattern or relationship between shoe size and IQ. Points are randomly scattered.

3. The Correlation Coefficient (r)

While scatter plots let us see relationships, the correlation coefficient gives us a numerical measure of the strength and direction of a linear relationship.

Correlation Coefficient (r)

The correlation coefficient, denoted r, measures the strength and direction of the linear relationship between two quantitative variables.

Properties:

  • Range: -1 ≤ r ≤ 1 (always between -1 and 1)
  • r = 1: Perfect positive linear relationship (all points exactly on an upward line)
  • r = -1: Perfect negative linear relationship (all points exactly on a downward line)
  • r = 0: No linear relationship
  • Sign: Positive r = positive association; Negative r = negative association
  • Magnitude: |r| closer to 1 = stronger relationship; |r| closer to 0 = weaker relationship

Interpreting r Values

r Value Interpretation Example
r = 1.0 Perfect positive linear relationship Celsius and Fahrenheit temperature
0.7 ≤ r < 1.0 Strong positive linear relationship Height and weight (r ≈ 0.80)
0.3 ≤ r < 0.7 Moderate positive linear relationship Study hours and test scores (r ≈ 0.50)
0 < r < 0.3 Weak positive linear relationship Age and height in adults (r ≈ 0.10)
r = 0 No linear relationship Shoe size and IQ
-0.3 < r < 0 Weak negative linear relationship Age and reflexes (r ≈ -0.20)
-0.7 < r ≤ -0.3 Moderate negative linear relationship Distance from equator and temperature (r ≈ -0.60)
-1.0 < r ≤ -0.7 Strong negative linear relationship Car age and resale value (r ≈ -0.85)
r = -1.0 Perfect negative linear relationship Speed and time to destination (fixed distance)

Formula for Correlation Coefficient

The correlation coefficient is calculated as:

r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]

Where:

  • x̄ = mean of x values
  • ȳ = mean of y values
  • Σ = sum across all data points

Note: In practice, we use technology (calculators, software) to compute r. The formula helps us understand what r measures: how x and y vary together relative to their individual variations.

Example: Calculating Correlation

Let's calculate the correlation between study hours and test scores for 5 students:

Student Study Hours (x) Test Score (y)
A265
B475
C680
D890
E1095

Step 1: Calculate means

  • x̄ = (2 + 4 + 6 + 8 + 10) / 5 = 30 / 5 = 6 hours
  • ȳ = (65 + 75 + 80 + 90 + 95) / 5 = 405 / 5 = 81 points

Step 2: Calculate deviations and products

x y (x - x̄) (y - ȳ) (x - x̄)(y - ȳ) (x - x̄)² (y - ȳ)²
265-4-166416256
475-2-612436
6800-1001
8902918481
10954145616196
Sums:15040570

Step 3: Apply the formula

r = 150 / √(40 × 570) = 150 / √22,800 = 150 / 151.0 ≈ 0.993

Interpretation: r = 0.993 indicates a very strong positive linear relationship between study hours and test scores. More study hours are strongly associated with higher test scores.

4. Properties and Limitations of Correlation

Important Properties of r

Properties of the Correlation Coefficient

  1. Unitless: r has no units (not hours, pounds, etc.). This allows us to compare correlations across different contexts.
  2. Symmetric: The correlation of x with y equals the correlation of y with x. r(x,y) = r(y,x).
  3. Measures LINEAR relationships only: r only detects straight-line patterns. A curved relationship might have r ≈ 0 even if there's a strong non-linear relationship.
  4. Sensitive to outliers: A single extreme point can dramatically change r.
  5. Not resistant: Outliers can inflate or deflate r, giving a misleading picture of the relationship.

Correlation Measures LINEAR Relationships Only

A correlation close to 0 means no LINEAR relationship, but there could still be a strong non-linear (curved) relationship!

Example: The relationship between height and age in children (ages 0-18) is strongly curved, but might have a moderate linear correlation.

Correlation Does NOT Imply Causation

This is perhaps the most important concept in all of statistics:

CORRELATION ≠ CAUSATION

Just because two variables are correlated does NOT mean one causes the other!

Example 1: Ice cream sales and drowning deaths are positively correlated. Does ice cream cause drowning? No! Both increase in summer due to a third variable: hot weather (confounding variable).

Example 2: Number of firefighters at a fire and amount of fire damage are positively correlated. Do firefighters cause damage? No! Bigger fires require more firefighters AND cause more damage.

Example 3: Shoe size and reading ability are positively correlated in children. Do bigger feet cause better reading? No! Both increase with age.

Three possible explanations for correlation:

  1. x causes y: Study hours → higher test scores (plausible)
  2. y causes x: Higher test scores → more studying (reverse causation)
  3. Third variable (confounding): Both x and y are caused by something else (intelligence → more studying AND higher scores)

To establish causation, you need:

Check Your Understanding

Question 1: A researcher finds r = -0.75 between hours of TV watched per week and GPA. What does this tell us?

Answer: This indicates a strong negative linear relationship. Students who watch more TV tend to have lower GPAs, and vice versa. The relationship is fairly strong (|r| = 0.75 is close to 1).

Important: This does NOT mean watching TV causes lower GPAs! There could be confounding variables (time management skills, interest in academics, etc.).

Question 2: If r = 0.40 between height and salary, can we conclude that being taller causes higher salary?

Answer: No! Correlation does not imply causation. While there's a moderate positive correlation, this could be due to:

  • Confounding variables: Gender (men tend to be taller and historically have earned more)
  • Confounding variables: Occupation (certain high-paying jobs may favor taller individuals, like professional basketball)
  • Coincidence: Weak correlations (r = 0.40 is moderate) could arise by chance

To establish causation, we'd need experimental evidence, not just correlation.

Question 3: A scatter plot shows a strong curved pattern, but r = 0.05. Is there a relationship between the variables?

Answer: Yes, there IS a relationship! The correlation coefficient r only measures LINEAR relationships. A value near 0 means no linear relationship, but there can still be a strong non-linear (curved) relationship.

Lesson: Always look at the scatter plot! Don't rely on r alone.

← Back to Module 11 Next: Lesson 2 - Regression Equation →