Recall that there are three basic assumptions about the random deviations
(errors),
: the random deviations are independent, normally distributed, and have a
constant variance. In simple linear regression, we also assume that Y
and X are linearly related. We shall consider the use of residual
plots for examining the following types of departures from the assumed
model.
- The regression function is not linear.
- The error terms do not have a constant variance.
- The model fits all but one or a few outlying observations.
- The error terms are not normally distributed.
- The error terms are not independent.
The common graphical tools for assumption checking includes:
- Residual Plot- scatter plot the residuals against X or the
fitted value.
- Absolute Residual Plot- scatter plot the absolute values of the
residuals against X or the fitted value.
- Normal Probability Plot of the Residuals.
- Time Series Plot of the Residuals - scatter plot the residuals against
time or index.
- The time series plot of the residuals are strongly recommended
whenever data are obtained in a time sequence. The purpose is to see if
there is any correlation between the error terms over time (the error
terms are not independent). When the error terms are independent, we
expect the residuals to fluctuate in a more or less random pattern
around the base line 0.
In the following example, since the observations are from independent
individuals, we will just use the first three plots to do assumption
checking.
EXAMPLE:\ A cardiology data set was
collected by the University of Virginia School of Medicine. Two variables
were examined, aortic valve area (AVA) and body surface area (BSA).
Physiologically, as children grow, the intracardiac areas also grow. A
linear model which relates AVA with BSA, a proxy for physiological growth,
has been widely accepted in medical science. (see Gutgesell and Rembold,
1990 and its references). The top left, top right, bottom left, bottom right
plots in Figure (AVA vs. BSA) are, respectively, the scatter plot of AVA vs.
BSA with the fitted regression line, normal probability plot of the
residuals, residual plots with the base line 0 and the absolute residual
plot. We can see that (1) AVA and BSA seem to be linearly related (2) The
error variance increases with the BSA. (2) is even more obvious in the
absolute residual plot. Since we know now that the error terms do not have a
common variance, the normal probability plot does not provide much
information here. We just note that the errors are not normal when the
points do not follow a straight line.
In the Figure (log[AVA] vs. BSA), we do the same plots but replace
AVA by log[AVA]. We can see that log[AVA] is not linearly related with BSA
in the scatter plot. We also notice the systematic pattern (curvature) in
the residual plot, which indicates a departure from a linear model.
We now log-transform both AVA and BSA and obtain the plots in Figure (log[AVA]
vs. log[BSA]). We see now the model fits all but one observation in the
left bottom corner. The residuals actually fluctuate in a more or less
random pattern around the base line 0. Also beside one point, the points in
the normal probability plot roughly follow a straight line.
 | Computer
Lab
 |
The goal of this lab is to learn how to use StataQuest to do data
analysis and how to use the diagnostic plots provided by Stataquest to check
the model assumption. We will need the following commands for the regression
analysis: Statistics
Simple Regression. There are many interesting diagnostic plots
provided by StataQuest. To understand them, we can use the following
example:
First open the data file reg.dta under the datagen
directory. There are several variables in the file:
 | X: the independent variable.
 | e: the deviation which contains a random sample from a normal
distribution.
 | Y: Y=1+X+e
 | :
 | :
 | :
= Y
except the observation corresponding to the max(X) is replaced by
the original Y value plus 20.
 | :
= Y
except the observation with the X value closest to
is replaced by the original Y value plus 20. |
| | | | | |
You may look at the normal quantile plot of e and the scatter plot of
e versus X and then compare them with the diagnostic plots
provided by Stataquest after you regress Y on X. You may also
want to get the diagnostic plots after regressing
and on X,
respectively. What do you learn from the plots? Now compare the LS lines for
Y,
, vs. X.
What kind of observation has more potential to be influential?
 | Concept
Lab
 |
 | Ch 6: Bivariate Descriptive Statistics
Least Squares
|
%
|
| | |