Univariate
   

   

 Math Help -> Statistics -> Univariate Descriptive Statistics 

Contents of this section:

Here's what you'll find in this section:

bulletPercentiles
bullet

Percentiles

We will now look at some other measures of location and spread.

 
bulletThe (100p)th percentile of a population, called η(p) (η is the Greek letter eta), is the number such that (100p)% of the population ≤ η(p) and 100(1 — p)% of the population ≥ η(p).
bulletThe 95th percentile:
bullet95% ≤ η(.95), 5%  ≥ η(.95).
bulletThe 90th percentile:
bullet90% ≤ η(.90), 10%  ≥ η(.90).
bulletThe 75th percentile:
bullet75% ≤ η(.75), 25%  ≥ η(.75).
bulletThe 50th percentile:
bullet50% ≤ η(.50), 50%  ≥ η(.50).
bulletThe 25th percentile:
bullet25% ≤ η(.25), 75%  ≥ η(.25).
bulletThe 10th percentile:
bullet10% ≤ η(.10), 90%  ≥ η(.10).
bulletValues which divide the (ordered) data into fourths.
bulletQ1 (Lower Quartile): The 25th percentile.
bulletQ2 (Median): The 50th percentile.
bulletQ3 (Upper Quartile): The 75th percentile.
bulletCalculating Sample Percentiles
  1. Order the n data values from lowest to highest.
  2. p=.50: Calculate the sample median.
  3. p=.25 or .75:
    bulletIf n is even:
    bulletQ1 = median of the lower half of the data.
    bulletQ3 = median of the upper half of the data.
    bulletIf n is odd:
    bulletQ1 = median of the lower "half'' of the data (including tex2html_wrap_inline2657, the median).
    bulletQ3 = median of the upper "half'' of the data (including &Xtilde;X-tilde ).
  4. p ≠ .25, .5, .75:
    bulletCompute np and round up, call this number m.
    bulletUse the mth point in order.
bulletThe maximum data value minus the minimum data value: tex2html_wrap_inline2765 .
bulletRange (IQR) The value tex2html_wrap_inline2767 .

 

bulletBoxplots
bullet

Boxplots

The five-number summary is an abbreviated way to describe a sample. The five number summary is a list of the following numbers:

  1. Minimum
  2. First (Lower) Quartile, tex2html_wrap_inline2769
  3. Median, tex2html_wrap_inline2657
  4. Third (Upper) Quartile, tex2html_wrap_inline2773
  5. Maximum

The five number summary leads to a graphical representation of a distribution called the boxplot. Boxplots are ideal for comparing two nearly-continuous variables. To draw a boxplot (see the example in the figure below), follow these simple steps:

  1. The ends of the box (hinges) are at the quartiles, so that the length of the box is the tex2html_wrap_inline2775 .
  2. The median is marked by a line within the box.
  3. The two vertical lines (called whiskers) outside the box extend to the smallest and largest observations within tex2html_wrap_inline2777 of the quartiles.
  4. Observations that fall outside of tex2html_wrap_inline2779 are called extreme outliers and are marked, for example, with an open circle. Observations between tex2html_wrap_inline2777 and tex2html_wrap_inline2779 are called mild outliers and are distinguished by a different mark, e.g., a closed circle.

EXAMPLE: To illustrate boxplots, the figure below puts boxplots side by side of the same four data sets that had histograms in the figure in Week 1.

 

bulletNormal Quantile Plots
bullet

Normal Quantile Plots

In many places during this course we will assume that a sample comes from a population having the normal (bell-shaped) distribution. A plot based on percentiles that seeks to verify this assumption is called the normal quantile plot. This is a scatterplot of the percentiles of the data versus the percentiles of a population in fact having the normal distribution. If the data do come from a normal population, the resulting points should fall closely along a straight line.

To illustrate this, the figure below shows the normal quantile plot of a random sample of 50 IQ's (we said earlier that IQ's do in fact follow a normal distribution). Notice how the points closely follow the line.

 

 

 

 

 

 

 

To better understand the information that the normal quantile plots provide us and the relationship among distributions , histograms, box plots and normal quantile plots, we can look at the figure at the previous page. The 4 plots on the first row indicate the distributions where the data are sampled from. The second, the third and the fourth rows are, respectively, the corresponding histograms, boxplots and normal quantile plots. The four distributions are normal, long tailed, short tailed and skewed, respectively.

 

bulletChebychev and Empirical Rules
bullet

Chebychev and Empirical Rules

Knowing the mean and standard deviation of a sample or a population gives us a good idea of where most of the data values are because of the following two rules:

 
bullet's Rule The proportion of observations within k standard deviations of the mean, where tex2html_wrap_inline2793 , is at least tex2html_wrap_inline2795 , i.e., at least 75%, 89%, and 94% of the data are within 2, 3, and 4 standard deviations of the mean, respectively.
bulletEmpirical Rule If data follow a bell-shaped curve, then approximately 68%, 95%, and 99.7% of the data are within 1, 2, and 3 standard deviations of the mean, respectively.

EXAMPLE: A pharmaceutical company manufactures vitamin pills which contain an average of 507 grams of vitamin C with a standard deviation of 3 grams. Using Chebychev's rule, we know that at least

displaymath2797

or 75% of the vitamin pills are within k=2 standard deviations of the mean. That is, at least 75% of the vitamin pills will have between 501 and 513 grams of vitamin C, i.e.,

displaymath2801

 

EXAMPLE: If the distribution of vitamin C amounts in the previous example is bell shaped, then we can get even more precise results by using the empirical rule. Under these conditions, approximately 68% of the vitamin pills have a vitamin C content in the interval [507-3,507+3]=[504,510], 95% are in the interval [507-2(3),507+2(3)]=[501,513], and 99.7% are in the interval [507-3(3),507+3(3)]=[498,516].

NOTE: Chebychev's rule gives only a minimum proportion of observations which lie within k standard deviations of the mean.

 

bullet 
bulletZ-Scores
bullet

Z-Scores

Z-scores are a means of answering the question ``how many standard deviations away from the mean is this observation?'' If our observation X is from a population with mean tex2html_wrap_inline2651 and standard deviation tex2html_wrap_inline2697 , then

displaymath2821

On the other hand, if the observation X is from a sample with mean tex2html_wrap_inline2643 and standard deviation s, then

displaymath2829

A positive (negative) Z-score indicates that the observation is greater than (less than) the mean.

EXAMPLE: In a certain city the mean price of a quart of milk is 63 cents and the standard deviation is 8 cents. The average price of a package of bacon is $1.80 and the standard deviation is 15 cents. If we pay $0.89 for a quart of milk and $2.19 for a package of bacon at a 24-hour convenience store, which is relatively more expensive? To answer this, we compute Z-scores for each:

eqnarray463

Our Z-scores show us that we are overpaying quite a bit more for the milk than we are for the bacon.

Because of the Empirical rule (or the Chebychev's rule), the Z-score of a given observation also provides insight on how ``typical'' this observation is to the population. For example, by empirical rule, if data follow a bell-shaped curve, then approximately 95% of the data should have the Z-score between -2 and 2.

 

 

Internet References

 

Related pages in this website

 

 

The webmaster and author of the Math Help site is Graeme McRae.
     [home]  [email]  [search]  [Links to Math Sites]  [Whiteboard]