Lesson 18: Cautions about Correlation and Regression
October 31, 2024
Review:
- Tuesday activity data
- Pearson Correlation Coefficient
- R-Squared
- Schedule
- Week of Nov 4 – Residuals (and maybe some election data)
- Week of Nov 11 – Review and Exam 3 (Thu Nov 14)
- Week of Nov 18 – Return Exam 3, portfolio check and Make Up Exam (Thu Nov 21)
- Week of Nov 25 – Thanksgiving
- Week of Dec 2 – Review for Final Exam (Tue Dec 10 at 10:30)
Presentation:
- Cautions about Correlation and Regression
- Influential Observations
- Outliers
- Diabetes and blood sugar (p. 129-130)
- Use Tuesday activity data
- Lurking Variables
- Explanatory variables not included
- Beyond scope
- Influential Observations
-
- The Question of Causation
- Correlation does not imply causation
- Spurious Correlations
- Retrospective study = looking back to find possible causes for an established outcome among a sample population
- Prospective studies = following a sample population over time and studying behaviors possibly linked to likelihood of an outcome
- Video
- Extrapolation
- Use of regression for prediction far outside the range of the explanatory variable
- Predictions outside the range are unreliable
- Further outside the range = less reliable predictions
- Sensitivity Analysis Demonstration
- Remove influential observations and recalculate regression equation and R-Squared
- Use glucose level data
- The Question of Causation
Presidential Election Year | Time Period | Ballots Cast (thousands) |
2004 | 1 | 68.4 |
2008 | 2 | 73.9 |
2012 | 3 | 77.7 |
2016 | 4 | 78.7 |
- Produce a scatter plot with Time Period on the x axis and Ballots Cast on the y axis
- Find the linear regression equation and the corresponding R-Squared value
- Remove the 2004 data
- Recalculate the linear regression equation and the corresponding R-Squared value
- Remove the 2016 data
- Recalculate the linear regression equation and the corresponding R-Squared value
Activity:
Price of Coffee ($ per pound) | Deforestation (%) |
0.29 | 0.49 |
0.40 | 1.59 |
0.54 | 1.69 |
0.55 | 1.82 |
0.72 | 3.10 |
- Produce a scatter plot with Price of Coffee on the x axis and Deforestation on the y axis
- Find the linear regression equation and the corresponding R-Squared value
- Estimate Deforestation % assuming Price of Coffee is $0.90 per pound.
- Sensitivity Analysis
- Identify and remove the most influential observation (use scatter plot)
- Recalculate the linear regression equation and the corresponding R-Squared value
- Recalculate Deforestation % assuming Price of Coffee is $0.90 per pound.
- Go back to Lesson 15 and use the data from the Activity section
- Follow the same process, i.e., look for outliers, remove and recalculate.