# document.write (document.title)

 Math Help > Statistics > Correlation Inferences

### Inferences for Correlation

Here's what you'll find in this section:

A student alsked,

Of the men who are 68 inches tall, what percentage have forearms which are 18 inches long, to the nearest inch? We are given the following:

Avg height is 68 inches, SD=2.7 in
Avg forearm is 18 inches, SD = 1 in
r = 0.80

We are allowed to assume the distributions of the two random variables are normal.

• # Inferences for Population Correlation Coefficient

Recall that the population correlation coefficient can be estimated by the sample correlation coefficient r, where [ r=S_xyS_xxS_yy ] Assuming the pair (X,Y) has a bivariate normal distribution and using the aforementioned rule, we can find the confidence interval for as well as test for dependence between X and Y.

to 102ptHypothesis:to 10pt to 103pt to 102ptStatistic:to 10pt to 103pt to 102ptInterval:to 10pt to 103pt to

• Residual Plots and Regression Assumptions
• # Residual Plots and Regression Assumptions

Recall that there are three basic assumptions about the random deviations (errors), : the random deviations are independent, normally distributed, and have a constant variance. In simple linear regression, we also assume that Y and X are linearly related. We shall consider the use of residual plots for examining the following types of departures from the assumed model.

1. The regression function is not linear.
2. The error terms do not have a constant variance.
3. The model fits all but one or a few outlying observations.
4. The error terms are not normally distributed.
5. The error terms are not independent.
The common graphical tools for assumption checking includes:
1. Residual Plot- scatter plot the residuals against X or the fitted value.
2. Absolute Residual Plot- scatter plot the absolute values of the residuals against X or the fitted value.
3. Normal Probability Plot of the Residuals.
4. Time Series Plot of the Residuals - scatter plot the residuals against time or index.
5. The time series plot of the residuals are strongly recommended whenever data are obtained in a time sequence. The purpose is to see if there is any correlation between the error terms over time (the error terms are not independent). When the error terms are independent, we expect the residuals to fluctuate in a more or less random pattern around the base line 0.

In the following example, since the observations are from independent individuals, we will just use the first three plots to do assumption checking.

EXAMPLE:\ A cardiology data set was collected by the University of Virginia School of Medicine. Two variables were examined, aortic valve area (AVA) and body surface area (BSA). Physiologically, as children grow, the intracardiac areas also grow. A linear model which relates AVA with BSA, a proxy for physiological growth, has been widely accepted in medical science. (see Gutgesell and Rembold, 1990 and its references). The top left, top right, bottom left, bottom right plots in Figure (AVA vs. BSA) are, respectively, the scatter plot of AVA vs. BSA with the fitted regression line, normal probability plot of the residuals, residual plots with the base line 0 and the absolute residual plot. We can see that (1) AVA and BSA seem to be linearly related (2) The error variance increases with the BSA. (2) is even more obvious in the absolute residual plot. Since we know now that the error terms do not have a common variance, the normal probability plot does not provide much information here. We just note that the errors are not normal when the points do not follow a straight line.

In the Figure (log[AVA] vs. BSA), we do the same plots but replace AVA by log[AVA]. We can see that log[AVA] is not linearly related with BSA in the scatter plot. We also notice the systematic pattern (curvature) in the residual plot, which indicates a departure from a linear model.

We now log-transform both AVA and BSA and obtain the plots in Figure (log[AVA] vs. log[BSA]). We see now the model fits all but one observation in the left bottom corner. The residuals actually fluctuate in a more or less random pattern around the base line 0. Also beside one point, the points in the normal probability plot roughly follow a straight line.

• Computer Lab
• # Computer Lab

The goal of this lab is to learn how to use StataQuest to do data analysis and how to use the diagnostic plots provided by Stataquest to check the model assumption. We will need the following commands for the regression analysis: Statistics Simple Regression. There are many interesting diagnostic plots provided by StataQuest. To understand them, we can use the following example:

First open the data file reg.dta under the datagen directory. There are several variables in the file:

• X: the independent variable.
• e: the deviation which contains a random sample from a normal distribution.
• Y: Y=1+X+e
• :
• :
• : = Y except the observation corresponding to the max(X) is replaced by the original Y value plus 20.
• : = Y except the observation with the X value closest to is replaced by the original Y value plus 20.
You may look at the normal quantile plot of e and the scatter plot of e versus X and then compare them with the diagnostic plots provided by Stataquest after you regress Y on X. You may also want to get the diagnostic plots after regressing and on X, respectively. What do you learn from the plots? Now compare the LS lines for Y, , vs. X. What kind of observation has more potential to be influential?

• Concept Lab
• # Concept Lab

• Ch 6: Bivariate Descriptive Statistics Least Squares

%

### Internet references

Stat Trek: Important Statistics Formulas

### Related pages in this website

Go back to Math Help Home

The webmaster and author of this Math Help site is Graeme McRae.