 |
We will now look at some other measures of location and spread.
 | The (100p)th percentile of a population, called η(p)
(η is the Greek letter eta), is the number such that (100p)% of the population ≤ η(p)
and 100(1 — p)% of the population ≥ η(p).
 | The 95th percentile:
 | 95% ≤ η(.95), 5% ≥ η(.95). |
 | The 90th percentile:
 | 90% ≤ η(.90), 10% ≥ η(.90). |
 | The 75th percentile:
 | 75% ≤ η(.75), 25% ≥ η(.75). |
 | The 50th percentile:
 | 50% ≤ η(.50), 50% ≥ η(.50). |
 | The 25th percentile:
 | 25% ≤ η(.25), 75% ≥ η(.25). |
 | The 10th percentile:
 | 10% ≤ η(.10), 90% ≥ η(.10). |
|
| | | | |
 | Values which divide the (ordered) data into fourths.
 | Q1 (Lower Quartile): The 25th percentile.
 | Q2 (Median): The 50th percentile.
 | Q3 (Upper Quartile): The 75th percentile. |
| |
 | Calculating Sample Percentiles
- Order the n data values from lowest to highest.
- p=.50: Calculate the sample median.
- p=.25 or .75:
 | If n is even:
 | Q1 = median of the lower half of the data.
 | Q3 = median of the upper half of the data. |
|
 | If n is odd:
 | Q1 = median of the lower "half'' of the data (including
,
the median).
 | Q3 = median of the upper "half'' of the data (including
&Xtilde;
). |
|
|
|
p ≠ .25, .5, .75:
 | Compute np and round up, call this number m.
 | Use the mth point in order. |
|
 | The maximum data value minus the minimum data value:
.
 | Range (IQR) The value
. |
| | | |
 | Boxplots
 |
The five-number summary is an abbreviated way to describe a
sample. The five number summary is a list of the following numbers:
- Minimum
- First (Lower) Quartile,
- Median,
- Third (Upper) Quartile,
- Maximum
The five number summary leads to a graphical representation of a
distribution called the boxplot. Boxplots are ideal for comparing
two nearly-continuous variables. To draw a boxplot (see the example in the
figure below), follow these simple steps:
- The ends of the box (hinges) are at the quartiles, so that the length
of the box is the
.
- The median is marked by a line within the box.
- The two vertical lines (called whiskers) outside the box
extend to the smallest and largest observations within
of the quartiles.
- Observations that fall outside of
are called extreme outliers and are marked, for example, with
an open circle. Observations between
and
are called mild outliers and are distinguished by a different
mark, e.g., a closed circle.
EXAMPLE: To illustrate boxplots,
the figure below puts boxplots side by side of the same four data sets that
had histograms in the figure in Week 1.
 | Normal
Quantile Plots
 |
In many places during this course we will assume that a sample comes from
a population having the normal (bell-shaped) distribution. A plot based on
percentiles that seeks to verify this assumption is called the normal
quantile plot. This is a scatterplot of the percentiles of the data
versus the percentiles of a population in fact having the normal
distribution. If the data do come from a normal population, the resulting
points should fall closely along a straight line.
To illustrate this, the figure below shows the normal quantile plot of a
random sample of 50 IQ's (we said earlier that IQ's do in fact follow a
normal distribution). Notice how the points closely follow the line.
To better understand the information that the normal quantile plots
provide us and the relationship among distributions , histograms, box plots
and normal quantile plots, we can look at the figure at the previous page.
The 4 plots on the first row indicate the distributions where the data are
sampled from. The second, the third and the fourth rows are, respectively,
the corresponding histograms, boxplots and normal quantile plots. The four
distributions are normal, long tailed, short tailed and skewed,
respectively.
 | Chebychev
and Empirical Rules
 |
Knowing the mean and standard deviation of a sample or a population gives
us a good idea of where most of the data values are because of the following
two rules:
 | 's Rule The proportion of observations within k standard
deviations of the mean, where
, is at least
, i.e., at least 75%, 89%, and 94% of the data are within 2, 3, and 4
standard deviations of the mean, respectively.
 | Empirical Rule If data follow a bell-shaped curve, then approximately
68%, 95%, and 99.7% of the data are within 1, 2, and 3 standard
deviations of the mean, respectively. |
|
EXAMPLE: A pharmaceutical company
manufactures vitamin pills which contain an average of 507 grams of vitamin
C with a standard deviation of 3 grams. Using Chebychev's rule, we know that
at least
or 75% of the vitamin pills are within k=2 standard deviations of
the mean. That is, at least 75% of the vitamin pills will have
between 501 and 513 grams of vitamin C, i.e.,
EXAMPLE: If the distribution of
vitamin C amounts in the previous example is bell shaped, then we can get
even more precise results by using the empirical rule. Under these
conditions, approximately 68% of the vitamin pills have a vitamin C content
in the interval [507-3,507+3]=[504,510], 95% are in the interval
[507-2(3),507+2(3)]=[501,513], and 99.7% are in the interval
[507-3(3),507+3(3)]=[498,516].
NOTE: Chebychev's rule gives only a
minimum proportion of observations which lie within k
standard deviations of the mean.
 |
 | Z-Scores
 |
Z-scores are a means of answering the question ``how many
standard deviations away from the mean is this observation?'' If our
observation X is from a population with mean
and standard deviation
, then
On the other hand, if the observation X is from a sample with mean
and
standard deviation s, then
A positive (negative) Z-score indicates that the observation is
greater than (less than) the mean.
EXAMPLE: In a certain city the mean
price of a quart of milk is 63 cents and the standard deviation is 8 cents.
The average price of a package of bacon is $1.80 and the standard deviation
is 15 cents. If we pay $0.89 for a quart of milk and $2.19 for a package of
bacon at a 24-hour convenience store, which is relatively more expensive? To
answer this, we compute Z-scores for each:
Our Z-scores show us that we are overpaying quite a bit more for
the milk than we are for the bacon.
Because of the Empirical rule (or the Chebychev's rule), the Z-score
of a given observation also provides insight on how ``typical'' this
observation is to the population. For example, by empirical rule, if data
follow a bell-shaped curve, then approximately 95% of the data should have
the Z-score between -2 and 2.
|
| | | | | | | | |