-
Notifications
You must be signed in to change notification settings - Fork 0
Data Analysis: How to Interpret Regression Summary Plots (in R)
Have always wanted to collect a reference guide for interpreting the 4 regression plots provided in R when you do plot(lm(...))
- here it is!
- Fitted,
$\hat{y_{i}}$ - Residuals,
$e_{i}$ =$observed_{i} - fitted_{i}$ =$y_{i} - \hat{y_{i}}$ - Standardized Residuals,
$r_{i}$ : residual divided by estimate of its standard deviation
$r_{i} = \frac{e_{i}}{\sqrt{MSE * (1 - h_{ii})}}$ - Leverage,
$h_{ii}$ : "leverage of ith data point" - amount coefficients would change if observation removed. Not fully defining here; can be accessed in a fitted model in R via functionhatvalues
. For more info see [1]
-
X Axis: Fitted Values,
$\hat{y_{i}}$ -
Y Axis: Residuals,
$e_{i}$ -
What it Shows / How to Use:
- Shows whether linearity holds (i.e. mean of residual ~ 0 over whole x axis) --> visually, red line is near dashed line
- Shows whether homoskedasticity holds (spread of residuals should be approximately the same over whole x axis)
- Shows if there are outliers (labeled as numbered points)
Example 1: Visual Pass vs. Fail (source: [2])
Example 2: Seemingly quadratic pattern w/ some pretty extreme outliers (source: [3])
library(mlbench)
data(BostonHousing)
plot(lm(medv ~ crim + rm + tax + lstat, data = BostonHousing))
- X Axis: Theoretical Quantiles - quantiles calculated for normal distribution (assuming normal here...can do Q-Q Plot for other distributions)
- Y Axis: Standardized Residuals - sample data, sorted in ascending order, and then quantiles calculated within the sample
-
What it Shows / How to Use:
- Reference Line - Y = X; this is what we compare the scatterplot against. Plot follows diagonal if Y Axis & X Axis quantiles are from same distribution
Likely Normal (Pass) vs. Likely Non-Normal (Fail) (source: [2])
-
X Axis: Fitted Values,
$\hat{y_{i}}$ -
Y Axis: Square Root of Standardized Residuals,
$\sqrt{r_{i}}$ -
What it Shows / How to Use:
- Verify red line is ~ horizontal (shows homoskedasticity, use above hypothesis test if needed)
- Verify no clear pattern among residuals - should be randomly scattered around red line with equal variability at all fitted values
Example 1: Visual Pass vs. Fail (source: [2])
Example 2: Boston dataset from above (source: [4])
Taking the Boston Housing data from Example 2 and formally test homoskedasticity, via a Breusch Pagan Test
(Here the null hypothesis is homoskedasticity, and with p ~ 0 we reject it)
model<-lm(medv ~ crim + rm + tax + lstat, data = BostonHousing)
bptest(model)
studentized Breusch-Pagan test
data: model
BP = 30.934, df = 4, p-value = 3.158e-06
-
X Axis: Leverage,
$h_{ii}$ -
Y Axis: Standardized Residuals,
$r_{i}$ -
What it Shows / How to Use
- Lets us identify influential observations in the regression model
- If any point falls outside of Cook's distance (red dashed lines, typically labeled 0.5, 1.0, ...), then consider that an influential observation and investigate it further
Visual Pass vs. Fail (source: [2])
[1] Penn State STAT 462: Applied Regression Analysis, "9.2 - Using Leverages to Help Identify Extreme X Values", https://online.stat.psu.edu/stat462/node/171/, accessed 2022-09-18
[2] University of Virginia Library Website, "Understanding Diagnostic Plots for Linear Regression Analysis", https://data.library.virginia.edu/diagnostic-plots/, accessed 2022-09-18
[3] Moreno, Alexander; Boosted ML: Articles on Statistics and Machine Learning for Healthcare, "Linear Regression Plots: Fitted vs Residuals", https://boostedml.com/2019/03/linear-regression-plots-fitted-vs-residuals.html, accessed 2022-09-18
[4] Moreno, Alexander; Boosted ML: Articles on Statistics and Machine Learning for Healthcare, "The Scale Location Plot: Interpretation in R", https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html, accessed 2022-09-18