Learn Without Walls

Practice Problems: Simple Linear Regression

20 comprehensive problems covering correlation, regression, hypothesis testing, and intervals

Home > Intro Stats > Module 11 > Practice Problems

How to Use These Practice Problems

Part 1: Correlation and Scatter Plots (Problems 1-5)

Covers Lesson 1 content: bivariate data, correlation, scatter plot interpretation

1

Interpreting Correlation

A study of 50 adults found the following correlations:

a) Describe the relationship between age and blood pressure.

b) Describe the relationship between age and reaction time.

c) Which relationship is strongest? How do you know?

d) Can we conclude that aging causes higher blood pressure? Explain.

Solution:

a) Age and blood pressure (r = 0.72):

There is a strong positive linear relationship. As age increases, blood pressure tends to increase. The relationship is fairly strong since r is close to 1.

b) Age and reaction time (r = -0.68):

There is a moderately strong negative linear relationship. As age increases, reaction time increases (people get slower), hence the negative correlation if we're measuring speed. Or if measuring time to react, as age increases, it takes longer (positive relationship with time). The magnitude |r| = 0.68 indicates moderate to strong association.

c) Strongest relationship:

Age and blood pressure has the strongest relationship (|r| = 0.72 is largest). We compare the absolute values to determine strength: |0.72| > |0.68| > |0.05|.

d) Causation?

No! Correlation does not imply causation. While there's a strong correlation, this could be due to:

  • Confounding variables (diet, exercise, genetics, stress)
  • Aging may be associated with lifestyle factors that affect blood pressure
  • To establish causation, we'd need experimental evidence (which is unethical for aging!)
2

Calculating Correlation

Six students reported their hours of sleep before an exam and their exam scores:

Student A B C D E F
Sleep (x) 4 5 6 7 8 9
Score (y) 65 70 78 82 88 90

a) Calculate x̄ and ȳ.

b) Given: Σ(x - x̄)² = 17.5, Σ(y - ȳ)² = 514.5, Σ(x - x̄)(y - ȳ) = 91. Calculate r.

c) Interpret the correlation coefficient.

Solution:

a) Calculate means:

x̄ = (4 + 5 + 6 + 7 + 8 + 9) / 6 = 39 / 6 = 6.5 hours

ȳ = (65 + 70 + 78 + 82 + 88 + 90) / 6 = 473 / 6 = 78.83 points

b) Calculate r:

Using the formula: r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² × Σ(y - ȳ)²]

r = 91 / √(17.5 × 514.5)

r = 91 / √9,003.75

r = 91 / 94.89

r = 0.959

c) Interpretation:

r = 0.959 indicates a very strong positive linear relationship between hours of sleep and exam scores. Students who get more sleep tend to score significantly higher on exams. The relationship is nearly perfect (r is very close to 1).

3

Scatter Plot Analysis

Describe the direction, form, and strength of the relationship you would expect to see in scatter plots for each of these pairs:

a) Number of hours exercising per week (x) vs. resting heart rate (y)

b) Years of education (x) vs. annual income (y)

c) Outdoor temperature (x) vs. hot chocolate sales (y)

d) Shoe size (x) vs. intelligence (y)

Solution:

a) Exercise hours vs. resting heart rate:

  • Direction: Negative (as exercise increases, resting heart rate decreases)
  • Form: Roughly linear
  • Strength: Moderate (other factors like genetics, age affect heart rate)

b) Years of education vs. income:

  • Direction: Positive (more education → higher income)
  • Form: Roughly linear
  • Strength: Moderate (education is one of many factors affecting income)

c) Temperature vs. hot chocolate sales:

  • Direction: Negative (higher temperature → fewer sales)
  • Form: Linear
  • Strength: Moderately strong (people drink hot beverages when it's cold)

d) Shoe size vs. intelligence:

  • Direction: No clear direction
  • Form: No form (random scatter)
  • Strength: No relationship (r ≈ 0)
  • These variables are unrelated!
4

Properties of Correlation

True or False? Explain your reasoning.

a) If r = 0.9 for the relationship between x and y, then r = 0.9 for the relationship between y and x.

b) If r = 0, there is no relationship between x and y.

c) The correlation between height in inches and weight in pounds is the same as the correlation between height in centimeters and weight in kilograms.

d) A correlation of r = -0.8 is stronger than a correlation of r = 0.5.

Solution:

a) True. Correlation is symmetric: r(x,y) = r(y,x). The correlation between x and y equals the correlation between y and x.

b) False. If r = 0, there is no LINEAR relationship, but there could still be a strong non-linear (curved) relationship. Always look at the scatter plot!

c) True. Correlation is unitless. Changing units doesn't change the correlation coefficient. Whether you measure in inches/pounds or cm/kg, r remains the same.

d) True. Strength is measured by |r| (absolute value). |-0.8| = 0.8 > |0.5| = 0.5, so r = -0.8 indicates a stronger relationship. The sign indicates direction, not strength.

5

Correlation and Causation

A researcher finds a strong positive correlation (r = 0.85) between the number of fire trucks at a fire and the amount of damage (in dollars) caused by the fire.

a) Does this mean that fire trucks cause damage? Explain.

b) What's a more plausible explanation for this correlation?

c) Give another real-world example of a correlation that doesn't imply causation.

Solution:

a) No! Fire trucks do not cause damage. This is a classic example of correlation without causation.

b) Confounding variable:

The size/severity of the fire is a confounding variable that affects both:

  • Number of fire trucks: Bigger fires require more trucks
  • Amount of damage: Bigger fires cause more damage

So both variables increase together because they're both caused by fire severity, not because one causes the other.

c) Other examples:

  • Ice cream sales and drowning deaths (confounded by hot weather/summer)
  • Shoe size and reading ability in children (confounded by age)
  • Number of churches and number of bars in cities (confounded by population size)
  • Nicholas Cage films and pool drownings (spurious correlation/coincidence)

Part 2: The Regression Equation (Problems 6-10)

Covers Lesson 2 content: slope, intercept, predictions, residuals, r²

6

Finding the Regression Equation

For the sleep and exam score data from Problem 2:

a) Calculate the slope (b₁).

b) Calculate the intercept (b₀).

c) Write the regression equation.

d) Interpret the slope in context.

e) Interpret the intercept in context.

Solution:

a) Slope:

b₁ = r × (sᵧ / sₓ) = 0.959 × (10.12 / 1.871)

b₁ = 0.959 × 5.410 = 5.188 points/hour

b) Intercept:

b₀ = ȳ - b₁x̄ = 78.83 - (5.188)(6.5)

b₀ = 78.83 - 33.72 = 45.11 points

c) Regression equation:

ŷ = 45.11 + 5.188x

Where x = hours of sleep, ŷ = predicted exam score

d) Slope interpretation:

For each additional hour of sleep, the exam score increases by approximately 5.2 points on average.

e) Intercept interpretation:

When hours of sleep = 0, the predicted exam score is 45.11 points. However, this may not be meaningful since 0 hours of sleep is outside the range of our data and not realistic.

7

Making Predictions

The regression equation for predicting car price (in thousands of dollars) from age (in years) is:

ŷ = 28 - 2.1x

a) Predict the price of a 3-year-old car.

b) Predict the price of a 10-year-old car.

c) If the data ranged from 1 to 8 years old, which prediction is more reliable? Why?

d) What does the slope tell us about how car prices change with age?

Solution:

a) 3-year-old car:

ŷ = 28 - 2.1(3) = 28 - 6.3 = $21,700

b) 10-year-old car:

ŷ = 28 - 2.1(10) = 28 - 21 = $7,000

c) More reliable prediction:

The 3-year-old car prediction is more reliable because:

  • 3 years is within the data range (1-8 years) → interpolation
  • 10 years is outside the data range → extrapolation
  • Extrapolation assumes the linear relationship continues beyond the observed data, which may not be true
  • Very old cars may not depreciate linearly (might level off)

d) Slope interpretation:

The slope of -2.1 means that for each additional year of age, the car's value decreases by $2,100 on average. The negative slope indicates that older cars are worth less.

8

Residuals

Using the car price equation from Problem 7: ŷ = 28 - 2.1x

Three cars have the following data:

Car Age (years) Actual Price ($1000s)
A 2 25
B 5 16
C 7 12

a) Calculate the predicted price for each car.

b) Calculate the residual for each car.

c) Which car's actual price was most different from its predicted price?

Solution:

a) Predicted prices:

  • Car A: ŷ = 28 - 2.1(2) = 28 - 4.2 = 23.8 thousand
  • Car B: ŷ = 28 - 2.1(5) = 28 - 10.5 = 17.5 thousand
  • Car C: ŷ = 28 - 2.1(7) = 28 - 14.7 = 13.3 thousand

b) Residuals (y - ŷ):

  • Car A: 25 - 23.8 = +1.2 thousand (actual price $1,200 higher than predicted)
  • Car B: 16 - 17.5 = -1.5 thousand (actual price $1,500 lower than predicted)
  • Car C: 12 - 13.3 = -1.3 thousand (actual price $1,300 lower than predicted)

c) Most different:

Car B has the largest residual in absolute value (|−1.5| = 1.5). Its actual price was $1,500 below the predicted value, the biggest discrepancy. This car might be in poor condition or have high mileage.

9

Coefficient of Determination (r²)

For each scenario, calculate r² and interpret it:

a) Hours studied vs. test score: r = 0.60

b) Height vs. weight: r = -0.80

c) Temperature vs. ice cream sales: r = 0.90

Solution:

a) Hours studied vs. test score:

r² = (0.60)² = 0.36 = 36%

Interpretation: 36% of the variation in test scores is explained by hours studied. The remaining 64% is due to other factors (aptitude, prior knowledge, teaching quality, etc.).

b) Height vs. weight:

r² = (-0.80)² = 0.64 = 64%

Interpretation: 64% of the variation in weight is explained by height. Note that r² is positive even though r is negative! The remaining 36% is due to other factors (muscle mass, diet, body composition, etc.).

c) Temperature vs. ice cream sales:

r² = (0.90)² = 0.81 = 81%

Interpretation: 81% of the variation in ice cream sales is explained by temperature. Temperature is an excellent predictor! Only 19% is due to other factors (day of week, location, promotions, etc.).

10

Interpreting Regression Output

A regression analysis predicting college GPA from high school GPA yields:

a) Interpret the slope.

b) Predict the college GPA for a student with a 3.5 high school GPA.

c) Interpret r².

d) What does sₑ = 0.3 tell us?

Solution:

a) Slope interpretation:

For each 1-point increase in high school GPA, college GPA increases by 0.85 points on average.

b) Prediction:

ŷ = 0.5 + 0.85(3.5) = 0.5 + 2.975 = 3.475

We predict a college GPA of approximately 3.48 for a student with a 3.5 high school GPA.

c) r² interpretation:

r² = 0.49 means 49% of the variation in college GPA is explained by high school GPA. The remaining 51% is due to other factors (study habits, course difficulty, adjustment to college, work hours, etc.).

d) Standard error interpretation:

sₑ = 0.3 means the typical prediction error is about 0.3 GPA points. On average, our predictions are off by about 0.3 points above or below the actual college GPA.

Part 3: Hypothesis Testing in Regression (Problems 11-15)

Covers Lesson 3 content: testing correlation and slope, checking conditions

11

Testing Correlation

A researcher collects data on 30 students to examine the relationship between hours of social media use per day (x) and hours of sleep per night (y). The correlation is r = -0.42.

Test at α = 0.05 whether there is a significant linear relationship.

a) State the hypotheses.

b) Calculate the test statistic.

c) Find the critical value (df = 28, two-tailed, α = 0.05: t* = ±2.048).

d) Make a decision and state your conclusion.

Solution:

a) Hypotheses:

  • H₀: ρ = 0 (No linear relationship between social media use and sleep)
  • Hₐ: ρ ≠ 0 (There is a linear relationship)

b) Test statistic:

t = r√[(n-2)/(1-r²)]

t = -0.42√[(30-2)/(1-(-0.42)²)]

t = -0.42√[28/(1-0.1764)]

t = -0.42√[28/0.8236]

t = -0.42√34.0

t = -0.42 × 5.831

t = -2.449

c) Critical value:

t* = ±2.048 (given)

d) Decision and conclusion:

Since |t| = |-2.449| = 2.449 > 2.048, we reject H₀.

Conclusion: There is sufficient evidence to conclude that there is a significant negative linear relationship between hours of social media use and hours of sleep (t = -2.449, p < 0.05). Students who use social media more tend to sleep fewer hours.

12

Testing Slope

A regression analysis examines whether square footage (x) predicts house price (y) in thousands of dollars. From a sample of 50 houses:

Test at α = 0.01 whether there is a significant relationship.

a) State the hypotheses.

b) Calculate the test statistic.

c) Using df = 48 and α = 0.01 (two-tailed), the critical value is approximately t* = ±2.68. Make a decision.

d) State your conclusion.

Solution:

a) Hypotheses:

  • H₀: β₁ = 0 (Square footage has no effect on price)
  • Hₐ: β₁ ≠ 0 (Square footage affects price)

b) Test statistic:

t = (b₁ - 0) / SE(b₁)

t = (0.15 - 0) / 0.04

t = 3.75

c) Decision:

Since |t| = 3.75 > 2.68, we reject H₀.

d) Conclusion:

There is sufficient evidence at α = 0.01 to conclude that square footage is a significant predictor of house price (t = 3.75, p < 0.01). Larger houses tend to have higher prices.

13

LINE Conditions

For each scenario, identify which LINE condition (if any) is violated:

a) The residual plot shows a clear U-shaped curve.

b) The residual plot shows a "fan shape" - narrow on the left, wide on the right.

c) Data comes from a time series where each observation depends on the previous one.

d) The residual plot shows random scatter with constant spread around zero.

Solution:

a) U-shaped curve:

Linearity condition violated. The relationship is not linear - it's curved. A linear model is not appropriate. Consider a quadratic or other non-linear model.

b) Fan shape:

Equal variance (homoscedasticity) condition violated. The spread of residuals is not constant - it increases with x. This violates the assumption that errors have constant variance. May need to transform variables.

c) Time series with dependence:

Independence condition violated. If observations are not independent (each depends on previous), the standard errors and hypothesis tests may be invalid. Need time series methods instead of simple linear regression.

d) Random scatter with constant spread:

No violations! This is good. The residual plot shows what we want to see: random scatter around zero with constant spread. All LINE conditions appear to be met.

14

Equivalence of Tests

For a dataset with n = 25:

a) If you tested H₀: β₁ = 0 with the same data, what t-value would you expect to get?

b) Explain why these two tests give the same result.

c) Using α = 0.05 and df = 23 (critical value ≈ 2.069), what conclusion would you reach?

Solution:

a) Expected t-value:

You would get t = 3.21 (the same value, or very close due to rounding).

b) Why are they equivalent?

Testing ρ = 0 and testing β₁ = 0 are mathematically equivalent:

  • If the population correlation is zero (ρ = 0), then the population slope must be zero (β₁ = 0)
  • If the population slope is zero (β₁ = 0), then the population correlation must be zero (ρ = 0)
  • Both test the same fundamental question: "Is there a linear relationship?"
  • The formulas yield the same t-statistic (possibly with slight rounding differences)

c) Conclusion:

Since t = 3.21 > 2.069, we reject H₀ for both tests. There is sufficient evidence of a significant linear relationship between x and y.

15

Checking Conditions

Before conducting regression inference, you must check the LINE conditions. For each condition, describe how you would check it:

a) Linearity

b) Independence

c) Normality

d) Equal variance

Solution:

a) Linearity:

  • Check scatter plot: Points should follow a roughly straight-line pattern (not curved)
  • Check residual plot: Should show random scatter with no systematic curved pattern
  • If violated: Consider non-linear models or variable transformations

b) Independence:

  • Check data collection method: Random sampling? No time-based ordering?
  • Observations should not depend on each other
  • No time series, no spatial dependence, no repeated measures
  • This is often judged from the study design, not from plots

c) Normality:

  • Histogram of residuals: Should look roughly bell-shaped/normal
  • Normal probability plot: Should be roughly linear
  • Less critical for large samples (n ≥ 30) due to Central Limit Theorem
  • If violated: May still be OK for large n; otherwise consider transformations

d) Equal variance (Homoscedasticity):

  • Residual plot: Vertical spread of residuals should be roughly constant across all x values
  • Look for "fan shapes" or other patterns in spread
  • If violated: Consider transforming y (e.g., log transformation)

Part 4: Confidence and Prediction Intervals (Problems 16-20)

Covers Lesson 4 content: CIs for slope and mean response, prediction intervals

16

Confidence Interval for Slope

A study of 40 adults finds the regression equation for predicting weight (kg) from height (cm):

a) Construct a 95% confidence interval for β₁.

b) Interpret the interval in context.

c) Does the interval support the conclusion that height is a significant predictor? Explain.

Solution:

a) 95% CI for β₁:

CI = b₁ ± t* × SE(b₁)

CI = 0.85 ± 2.024 × 0.12

CI = 0.85 ± 0.243

Lower: 0.85 - 0.243 = 0.607

Upper: 0.85 + 0.243 = 1.093

95% CI: (0.607, 1.093) kg/cm

b) Interpretation:

We are 95% confident that for each additional centimeter of height, weight increases by between 0.607 and 1.093 kilograms on average.

c) Significance:

Yes, the interval supports significance because it does not contain 0. Since 0 is not in the interval (0.607, 1.093), we can conclude that height is a significant predictor of weight at α = 0.05. If we tested H₀: β₁ = 0, we would reject it.

17

CI vs PI Comparison

For predicting house price from square footage at x* = 2000 sq ft:

a) Interpret the confidence interval.

b) Interpret the prediction interval.

c) Why is the prediction interval wider?

d) Which interval would a home buyer use? Which would a real estate analyst use?

Solution:

a) Confidence interval interpretation:

We are 95% confident that the average price of all 2000-square-foot houses is between $275,000 and $295,000.

b) Prediction interval interpretation:

We are 95% confident that a single, specific 2000-square-foot house will be priced between $250,000 and $320,000.

c) Why is PI wider?

The prediction interval is wider because it accounts for two sources of uncertainty:

  • Uncertainty in estimating the mean (same as CI)
  • Individual variation around the mean (additional uncertainty)

Individual houses vary in price due to condition, location, features, etc. Predicting for one specific house is harder than estimating an average.

d) Who uses which?

  • Home buyer: Use prediction interval ($250,000-$320,000). They want to know what this specific house might sell for.
  • Real estate analyst: Use confidence interval ($275,000-$295,000). They want to know the average market price for houses of this size for policy, appraisal standards, or market analysis.
18

Interpreting Intervals

A 90% confidence interval for the slope relating study hours to test scores is (3.2, 6.8) points per hour.

a) What does this interval tell us?

b) Would a hypothesis test of H₀: β₁ = 0 at α = 0.10 reject the null? How do you know?

c) Is β₁ = 5 a plausible value? Explain.

d) Is β₁ = 8 a plausible value? Explain.

Solution:

a) What it tells us:

We are 90% confident that for each additional hour of study, test scores increase by between 3.2 and 6.8 points on average. This gives us a range of plausible values for the true slope.

b) Hypothesis test decision:

Yes, we would reject H₀: β₁ = 0 at α = 0.10.

Reason: The confidence interval (3.2, 6.8) does NOT contain 0. Since 0 is not a plausible value for β₁, we have evidence that the slope is significantly different from 0. There is a significant relationship.

c) β₁ = 5 plausible?

Yes, β₁ = 5 is plausible because 5 is inside the interval (3.2, 6.8). We cannot rule out 5 as the true value of the slope.

d) β₁ = 8 plausible?

No, β₁ = 8 is not plausible because 8 is outside the interval (3.2, 6.8). If we tested H₀: β₁ = 8, we would reject it at α = 0.10. The data suggest the slope is smaller than 8.

19

Choosing the Right Interval

For each scenario, state whether you would use a confidence interval for mean response or a prediction interval:

a) A hospital wants to estimate the average blood pressure for all 60-year-old patients.

b) A doctor wants to predict the blood pressure of a specific 60-year-old patient.

c) A car dealer wants to know the typical resale value of 5-year-old sedans.

d) A car owner wants to know what their specific 5-year-old sedan will sell for.

Solution:

a) Average blood pressure for all 60-year-olds:

Confidence interval for mean response. The hospital wants to estimate a population average, not predict for an individual.

b) Specific patient's blood pressure:

Prediction interval. The doctor wants to predict the value for one specific individual, which has more uncertainty due to individual variation.

c) Typical resale value of 5-year-old sedans:

Confidence interval for mean response. The dealer wants to know the average/typical market price, not the price of one specific car.

d) This specific car's selling price:

Prediction interval. The owner wants to predict what this particular car will sell for, accounting for its specific condition, mileage, and features.

General rule:

  • Estimating an average/mean: Use confidence interval (narrower)
  • Predicting an individual: Use prediction interval (wider)
20

Comprehensive Problem

A study of 35 students examined the relationship between hours of sleep (x) and reaction time in milliseconds (y). Results:

a) Interpret the slope.

b) Interpret r².

c) Predict reaction time for a student who sleeps 7 hours.

d) Test H₀: β₁ = 0 at α = 0.05 (critical value ≈ ±2.03 for df = 33).

e) Construct a 95% CI for β₁ using t* = 2.03.

Solution:

a) Slope interpretation:

For each additional hour of sleep, reaction time decreases by 25 milliseconds on average. More sleep leads to faster reactions (lower reaction time).

b) r² interpretation:

r² = 0.42 means 42% of the variation in reaction time is explained by hours of sleep. The remaining 58% is due to other factors (caffeine, age, stress, natural ability, etc.).

c) Prediction:

ŷ = 450 - 25(7) = 450 - 175 = 275 milliseconds

d) Hypothesis test:

Hypotheses:

  • H₀: β₁ = 0 (No relationship)
  • Hₐ: β₁ ≠ 0 (Relationship exists)

Test statistic:

t = (b₁ - 0) / SE(b₁) = (-25 - 0) / 8 = -3.125

Decision: |t| = 3.125 > 2.03, so reject H₀

Conclusion: There is sufficient evidence that hours of sleep is a significant predictor of reaction time (t = -3.125, p < 0.05).

e) 95% CI for β₁:

CI = b₁ ± t* × SE(b₁)

CI = -25 ± 2.03 × 8

CI = -25 ± 16.24

Lower: -25 - 16.24 = -41.24

Upper: -25 + 16.24 = -8.76

95% CI: (-41.24, -8.76) ms/hour

Interpretation: We are 95% confident that for each additional hour of sleep, reaction time decreases by between 8.76 and 41.24 milliseconds on average. Note: Both bounds are negative, confirming the significant negative relationship.

Finished practicing?

Take the Module Quiz → ← Back to Module 11