Here's what you'll find in this section:
Before we looked at one measurement on an observation (or individual), say X is height. Now we're interested in more than one measurement per observation (individual), say X is height and Y is weight. Let's say we have n individuals we're taking the measurements on. Then our data would be as follows: .
Scatterplots, like histograms, are a good visual means to understanding patterns of bivariate numerical data. Construction of a scatterplot is straightforward: each point on a scatterplot corresponds to one bivariate observation.
The scatterplot gives us a visual means of seeing relationships between the two variables. We call a relationship positive if an increase in one variable corresponds to an increase in the other. When one variable increases and the other decreases, we call the relationship negative.
What can a scatterplot tell us? In general terms, it gives us an idea of what kind of relationships (or patterns) our bivariate data has. We may have
Scatterplots can also give us visual evidence of outliers or suspicious observations (details in Weeks 12 and 13).
NOTE: Scatterplots are used only for quantitative variables (those that are comparable numerically). Examples of quantitative variables are: height, weight, rates, counts, etc. Examples of qualitative variables (those which can not be compared numerically) are: color, type of car, sex, etc.
Just like other graphical methods we've discussed, e.g., histograms, there are numerical statistics which give us a more precise description of bivariate relationships. The two major ones we'll discuss are correlation and linear regression.
To get a measure of how strongly X and Y values are related, we will use the correlation coefficient. Correlation is concerned with trends: if X increases, does Y tend to increase or decrease? How much? How strong is this tendency?
Recall the equation of a line from algebra:
(You may have seen Y=mX+b, we are going to change notation slightly.) Above, is called the slope of the line and is the y-intercept. The slope measures the amount Y increases when X increases by one unit. The Y-intercept is the value of Y when X=0.
Our objective is to fit a straight line to points on a scatterplot that do not lie along a straight line (see the figure above). So we want to find and such that the line fits the data as well as possible. First, we need to define what we mean by a ``best'' fit. We want a line that is in some sense closest to all of the data points simultaneously. In statistics, we define a residual, , as the vertical distance between a point and the line,
(see the vertical line in the figure) Since residuals can be positive or negative, we will square them to remove the sign. By adding up all of the squared residuals, we get a measure of how far away from the data our line is. Thus, the ``best'' line will be one which has the minimum sum of squared residuals, i.e., min . This method of finding a line is called least squares.
The formulas for the slope and intercept of the least squares line are
Using algebra, we can express the slope as
A statistic that is widely used to determine how well a regression fits is the coefficient of determination (or multiple correlation coefficient), . represents the fraction of variability in y that can be explained by the variability in x. In other words, explains how much of the variability in the y's can be explained by the fact that they are related to x, i.e., how close the points are to the line. The equation for is
where SSTotal is the total sums of squares of the data.
NOTE: In the simple linear
is simply the square of the correlation coefficient.
Applicable StataQuest Commands:
Summaries Means and SDs
Summaries Means and SDs by group One-way of means
Summaries Tables One-way (frequency) to create relative frequency tables used in histograms
Graphs One variable Histogram Continuous variable
Graphs One variable Box plot
Graphs One variable Stem-and-leaf
Graphs One variable by group Histograms by group Continuous variable to compare data in one column with the group indicator in another column
Graphs One variable by group Box plot by group to compare data in one column with the group indicator in another column
Graphs Comparison of variables Boxplot comparison to compare data in multiple columns
The webmaster and author of this Math Help site is Graeme McRae.