Categorical Data Analysis
   

   

 Math Help -> Statistics -> Categorical Data Analysis 

Contents of this section:

Here's what you'll find in this section:

 

Categorical Data Analysis

 
bullet 
bulletComparing More Than Two Proportions
bullet

Comparing More Than Two Proportions

In Week 7 we saw how to compare two population proportions, tex2html_wrap_inline3669 and tex2html_wrap_inline3671 . In this section we consider proportions for more than two 0-1 populations. If we have K such populations and we have random samples of size tex2html_wrap_inline3637 from the populations, then typically we want to test whether whether the true population proportions are some hypothesized values tex2html_wrap_inline5373 . The most common example is whether all proportions are the same.

If the null hypothesis is true, then we would expect to get tex2html_wrap_inline5375 1's (that is `successes') in the ith sample. If we let tex2html_wrap_inline5379 denote the actual number of 1's in the ith sample, then we could measure how far the observed data is from what we expect if the null hypothesis is true by test statistic

displaymath5383

A large (small) value of this statistic is evidence against (not against) the null hypothesis.

We reject the null hypothesis if tex2html_wrap_inline5385 .

 
bulletOne Categorical Variable
bullet

One Categorical Variable

A slightly different situation to the previous section is if we have a random sample of n objects each of which can fall into exactly one of K `categories' (for example, roll a die 60 times; each time the die can be one of 6 values) and for the ith category we observe tex2html_wrap_inline5379 occurrences among the n objects. Now we let tex2html_wrap_inline5397 represent the hypothesized proportion of objects in the whole population of objects that fall into the ith category. Now we expect tex2html_wrap_inline5401 occurrences of category i and we can measure the distance between observed and expected results by the same tex2html_wrap_inline3701 statistic as above. The only difference in the procedure is that the degrees of freedom is now K-1 rather than K.


 
bulletGoodness of Fit Testing
bullet

Goodness of Fit Testing

An example of the one category situation is when we have a random sample tex2html_wrap_inline2631 of size n from a continuous population and we want to test whether the population has a particular distribution (such as normal). In this case we could divide the range of the data into K intervals and count how many of the X's fall into each interval. For example, if the hypothesis is that the X's come from a uniform distribution on the interval [0,1], we could divide [0,1] into the 10 intervals [0,.1), [.1,.2), and so on and then count how many X's are in each interval. Again we call the observed counts tex2html_wrap_inline5423 .

From the hypothesized distribution, we can calculate how many X's should be in each interval (for the uniform example, 10% of the X's should fall in each interval). Again we have tex2html_wrap_inline5401 where tex2html_wrap_inline5397 is the probability that an X falls in the ith interval.

In some cases, we need to estimate the parameters of the hypothesized distribution. In testing for normality for example, in order to find the tex2html_wrap_inline5397 's, we need to know the mean and variance of the population. If we use tex2html_wrap_inline2643 and tex2html_wrap_inline2669 as estimates of the true mean and variance, then we must further reduce the degrees of freedom of the tex2html_wrap_inline3701 statistic by 2 (one for each estimated parameter).

EXAMPLE:\ To see if there is a seasonal effect for homicide, 1361 crimes were classified into the four seasons, where 334 of them happened in spring, 372 in summer, 327 in Fall and 328 in winter. Do we have enough evidence to show that the crime frequencies are different for different seasons? Let tex2html_wrap_inline5397 , tex2html_wrap_inline5447 be the proportions of crimes for the four seasons, respectively.

  1. tex2html_wrap_inline5449 ; tex2html_wrap_inline5451 at least one inequality exists.
  2. tex2html_wrap_inline5453 = 1361*.25 = 340.25, tex2html_wrap_inline5447 .
  3. Chi-squared est statistics = 4.034 with d.f.=4-1=3.
  4. The rejection region is tex2html_wrap_inline5459 .
  5. Fail to reject tex2html_wrap_inline4309 and conclude that there is not enough evidence to show that there is a seasonal effect on the crime rate.
 
bulletInference for Two-Way Tables

Inference for Two-Way Tables

We now turn to the case where we have objects that can be categorized in two ways (such as gender and political party).


 
bullet 
bulletDescriptive Tables
bullet

Descriptive Tables

A two-way table with r rows and c columns contains tex2html_wrap_inline5467 sample counts. Let tex2html_wrap_inline5469 denote the number of observations in the ith row and the jth column, tex2html_wrap_inline5475 , tex2html_wrap_inline5477 . The general form of the data table giving the sample counts is as follows:

tabular2459

 

In this table tex2html_wrap_inline5557 are the column totals and tex2html_wrap_inline5559 are the row totals. Thus,

displaymath5561

If the total sample size is n, then

displaymath5565

For most tex2html_wrap_inline5467 tables, a better understanding of the information contained in the table is obtained by examining the column proportions/ which are defined as the jth column proportion tex2html_wrap_inline5571 . The resulting entries in each column form the conditional distribution/ of the row variable given that value of the column variable. Note that the sum of the entries in each column should be 1 (making exception only for round off error). Tables giving entries as proportions of row totals are also useful. Which description to use depends upon the particular set of data being analyzed and what questions are of interest.


 
bulletModels and Hypotheses
bullet

Models and Hypotheses

The test procedure for tex2html_wrap_inline5467 count data is sufficiently general so that it is valid for different assumptions regarding the data. The one assumption that must remain stringent however is that each experimental unit be counted only once in the data table.

The following table is a summary of the population proportions where a single SRS is taken from a single population and each observation is classified into one cell of an tex2html_wrap_inline5467 table.

tabular2500

 

The marginal proportions in this table are the sums of the proportions in the rows and columns. Here the tex2html_wrap_inline5687 are the row sums and the tex2html_wrap_inline5689 are the column sums. The marginal proportions are easily interpreted as probabilities. Each tex2html_wrap_inline5687 is the probability that a randomly selected member of the population falls in the ith row category. Similarly, each tex2html_wrap_inline5689 is the probability that a randomly selected member of the population falls in the jth column category.


 
bulletFirst Model for tex2html_wrap_inline5467 Tables
bullet

First Model for tex2html_wrap_inline5467 Tables

A SRS of size n is drawn from a population. Each individual in the sample is classified according to two categorical variables. The probabilities for the row classification are tex2html_wrap_inline5687 and the probabilities for the column classification are tex2html_wrap_inline5689 .

The null hypothesis is the the row and column classifications are independent; that is, there is no relationship between the row and column classifications. Letting tex2html_wrap_inline5707 denote the probability of an observation being classified in row i and in column j, the null hypothesis is

displaymath5713

 

The alternative hypothesis is the the row and column classifications are dependent; that is, the row and column classifications are related in some way. We write this alternative as

displaymath5715

 

The second model is a natural extension of the comparison of two proportions we studied in Section 9.2. That is, the c populations are independently sampled and the number of possible outcomes in each population is r where tex2html_wrap_inline5721 .


 
bulletSecond Model for tex2html_wrap_inline5467 Tables
bullet

Second Model for tex2html_wrap_inline5467 Tables

For each of c populations, independent SRSs of sizes tex2html_wrap_inline5727 are drawn. Each individual in a sample is classified according to a categorical outcome variable with r possible values. For the jth population the probability that an individual will fall into category i is tex2html_wrap_inline5735 .

The null hypothesis is that the distributions of the outcome variable are the same in all c populations. Letting tex2html_wrap_inline5735 denote the proportion of population j in category i, the null hypothesis is

 

eqnarray2538

 

The alternative hypothesis is tex2html_wrap_inline5001 : at least one of the equalities in tex2html_wrap_inline4201 does not hold.

The samples sizes from each of the populations are the column totals in the sample count table. Call these sample sizes tex2html_wrap_inline5557 . In the first model, the tex2html_wrap_inline5557 are random variables. The total samples size n is set by the researcher, and the column sums are known only after the data are analyzed. For the second model, the column sums are the sample sizes selected at the design phase of the research. The null hypothesis in both models says that there is no relationship between the column variable and the row variable. Although the hypothesis is expressed differently, the test of the hypothesis in each case is the same.


 
bulletExpected Counts
bullet

Expected Counts

The statistic that tests tex2html_wrap_inline4201 in tex2html_wrap_inline5467 tables compares the sample counts with expected/ counts that are calculated under the assumption that tex2html_wrap_inline4201 is true. The expected count in the ijth cell of the table is denoted by tex2html_wrap_inline5763 . For an tex2html_wrap_inline5467 table, the expected counts/ are calculated from the marginal totals in the samples count table using the formula

displaymath5767


 
bulletSignificance Tests
bullet

Significance Tests

To test tex2html_wrap_inline4201 , that there is no relationship between the row and column classifications, a statistic called the chi-square statistic/ is used. This statistic compares the sample counts with their expected values. Specifically, we take the difference between the sample count and its expected count, square these values, and divide by the expected count, then sum over all entries. That is, to compare the sample and expected counts we use a statistic tex2html_wrap_inline5771 , called the chi-square statistic. It is calculated from the following formula:

displaymath5773

where observed/ represents the sample counts, and expected/ represents the expected counts, and the sum is over all tex2html_wrap_inline5467 entries in the sample or expected count tables.

To test tex2html_wrap_inline4201 , we need a distribution to compare tex2html_wrap_inline5771 to, under the assumptions that tex2html_wrap_inline4201 is true. This leads us to the chi-squared distribution. The tex2html_wrap_inline3701 distribution is described by a single parameter, its degrees of freedom. Furthermore, the tex2html_wrap_inline3701 distribution is skewed to the right.

The data for an tex2html_wrap_inline5467 table can be obtained by random sampling as described by either of the two models previously discussed.

The null hypothesis to be tested is that the row and column classifications are independent (first model) or that the row classification proportions for the c populations are all equal (second model). The alternative hypothesis is that the null hypothesis is not true.

The test statistic is the tex2html_wrap_inline5771 statistic

displaymath5793

If tex2html_wrap_inline4201 is true, the statistic tex2html_wrap_inline3701 has approximately a tex2html_wrap_inline3701 distribution with (r - 1)(c - 1) degrees of freedom.

The p-value for the test is tex2html_wrap_inline5805 where tex2html_wrap_inline3701 is a random variable having the tex2html_wrap_inline5809 distribution. The approximation is based on having a large sample. The sample is judged large enough if the average of the expected counts is 5 or more, and the smallest expected count is 1 or more.

 
bulletSummary
bullet

Summary

 
bulletTwo different models for generating tex2html_wrap_inline5467 tables lead to the same analysis of two-way count data. In the first model, a SRS of size n is drawn from a population, and samples are classified according to two categorical variables having r and c possible values. In the second model, independent SRS of size tex2html_wrap_inline5557 are drawn from each of c populations, and each sample is classified according to a categorical variable with r possible values.
bulletThe null hypothesis is that there is no relationship between the column variable and row variable. In the first model, this means that the two variables are independent. In the second, it means that the distributions of the row categorical variable are the same for all c populations.
bulletExpected counts are computed using the formula

displaymath5767

where tex2html_wrap_inline5559 is the ith row total and tex2html_wrap_inline5557 is the jth column total.

bulletThe null hypothesis is tested by the chi-square statistic

displaymath5837

 

bulletUnder the null hypothesis, tex2html_wrap_inline5771 has approximately the tex2html_wrap_inline5809 distribution. The p-value for the test is tex2html_wrap_inline5805 where tex2html_wrap_inline3701 is a random variable having the tex2html_wrap_inline5809 distribution.

EXAMPLE: Each of 250 job applicants at a large firm was classified in two ways: (1) whether or not they got a job offer; and (2) their ethnic group.

tabular2583

 

Do the data indicate that receiving a job offer is independent of the ethnicity of the applicant? We obtain the following output from Stataquest:

 

           | col                                    
        row|       1       2       3 |  Total
-----------+-------------------------+-------
         1 |      24      13      18 |     55
         2 |     124      39      32 |    195
-----------+-------------------------+-------
      Total|     148      52      50 |    250
                                                        
          Pearson chi2(2) =   8.8688   Pr = 0.012

 
bullettex2html_wrap_inline5851 Receiving a job offer is independent of ethnicity
bullettex2html_wrap_inline5853 There is a relationship between receiving a job offer and ethnicity
bulletTest Statistic: tex2html_wrap_inline5771 = 8.869
bulletp-value: p-value = 0.012
bulletThe tex2html_wrap_inline4201 is rejected at 0.05 level.

 

bullet 
bulletComputer Lab
bullet

Computer Lab

Applicable StataQuest Commands:

Summaries tex2html_wrap_inline3057 Tables tex2html_wrap_inline3057 Two-way (cross-tabulation)

 
bulletConcept Lab
bullet

Concept Lab

 
bulletCh 19: Chi-square Goodness of Fit Test

 

Internet References

 

Related pages in this website

 

 

The webmaster and author of the Math Help site is Graeme McRae.
     [home]  [email]  [search]  [Links to Math Sites]  [Whiteboard]