 |
Before we looked at one measurement on an observation (or individual),
say X is height. Now we're interested in more than one measurement
per observation (individual), say X is height and Y is weight.
Let's say we have n individuals we're taking the measurements on.
Then our data would be as follows:
.
 | Scatterplots
 |
Scatterplots, like histograms, are a good visual means to
understanding patterns of bivariate numerical data. Construction of a
scatterplot is straightforward: each point on a scatterplot corresponds
to one bivariate observation.
The scatterplot gives us a visual means of seeing relationships
between the two variables. We call a relationship positive if
an increase in one variable corresponds to an increase in the other.
When one variable increases and the other decreases, we call the
relationship negative.
What can a scatterplot tell us? In general terms, it gives us an idea
of what kind of relationships (or patterns) our bivariate data has. We
may have
 | Positive (negative) linear relationship
 | Positive (negative) curved relationship
 | Other relationships
 | No relationship |
| | |
Scatterplots can also give us visual evidence of outliers or
suspicious observations (details in Weeks 12 and 13).
NOTE: Scatterplots are used
only for quantitative variables (those that are comparable numerically).
Examples of quantitative variables are: height, weight, rates, counts,
etc. Examples of qualitative variables (those which can not be compared
numerically) are: color, type of car, sex, etc.
Just like other graphical methods we've discussed, e.g., histograms,
there are numerical statistics which give us a more precise description
of bivariate relationships. The two major ones we'll discuss are correlation
and linear regression.
 | Correlation
 |
To get a measure of how strongly X and Y values are
related, we will use the correlation coefficient. Correlation is
concerned with trends: if X increases, does Y
tend to increase or decrease? How much? How strong is this tendency?
 | Least
Squares Line
 |
Recall the equation of a line from algebra:
(You may have seen Y=mX+b, we are going to
change notation slightly.) Above,
is called the slope of the line and
is the y-intercept. The slope measures the amount Y
increases when X increases by one unit. The Y-intercept is
the value of Y when X=0.
Our objective is to fit a straight line to points on a scatterplot
that do not lie along a straight line (see the figure above). So we want
to find
and
such that the line
fits the data as well as possible. First, we need to define what we mean
by a ``best'' fit. We want a line that is in some sense closest to all
of the data points simultaneously. In statistics, we define a residual,
, as
the vertical distance between a point and the line,
(see the vertical line in the figure) Since residuals can be positive
or negative, we will square them to remove the sign. By adding up all of
the squared residuals, we get a measure of how far away from the data
our line is. Thus, the ``best'' line will be one which has the minimum
sum of squared residuals, i.e., min
. This method of finding a line is called least squares.
The formulas for the slope and intercept of the least squares line
are
Using algebra, we can express the slope
as

 | Coefficient
of Determination
 |
A statistic that is widely used to determine how well a regression
fits is the coefficient of determination (or multiple correlation
coefficient),
.
represents the fraction of variability in y that can be explained
by the variability in x. In other words,
explains how much of the variability in the y's can be explained
by the fact that they are related to x, i.e., how close the
points are to the line. The equation for
is
where SSTotal is the total sums of squares of the data.
NOTE: In the simple linear
regression case,
is simply the square of the correlation coefficient.
|
| | | | | | |
 | Computer
Lab
 |
Applicable StataQuest Commands:
Summaries
Means and SDs
Summaries
Means and SDs by group
One-way of means
Summaries
Median/Percentiles
Summaries
Tables
One-way (frequency) to create relative frequency tables used in
histograms
Graphs
One variable
Histogram
Continuous variable
Graphs
One variable
Box plot
Graphs
One variable
Stem-and-leaf
Graphs
One variable by group
Histograms by group
Continuous variable to compare data in one column with the group
indicator in another column
Graphs
One variable by group
Box plot by group to compare data in one column with the group
indicator in another column
Graphs
Comparison of variables
Boxplot comparison to compare data in multiple columns
 | Concept
Lab
 |
 | Ch 4: How Are Populations Distributed?
 | Ch 6: Bivariate Descriptive Statistics, Scatterplots I and II |
|
%
|
| | | |