
Here's what you'll find in this section:
Categorical Data Analysis
In Week 7 we saw how to compare two population proportions, and . In this section we consider proportions for more than two 01 populations. If we have K such populations and we have random samples of size from the populations, then typically we want to test whether whether the true population proportions are some hypothesized values . The most common example is whether all proportions are the same.
If the null hypothesis is true, then we would expect to get 1's (that is `successes') in the ith sample. If we let denote the actual number of 1's in the ith sample, then we could measure how far the observed data is from what we expect if the null hypothesis is true by test statistic
A large (small) value of this statistic is evidence against (not against) the null hypothesis.
We reject the null hypothesis if .
A slightly different situation to the previous section is if we have a random sample of n objects each of which can fall into exactly one of K `categories' (for example, roll a die 60 times; each time the die can be one of 6 values) and for the ith category we observe occurrences among the n objects. Now we let represent the hypothesized proportion of objects in the whole population of objects that fall into the ith category. Now we expect occurrences of category i and we can measure the distance between observed and expected results by the same statistic as above. The only difference in the procedure is that the degrees of freedom is now K1 rather than K.
An example of the one category situation is when we have a random sample of size n from a continuous population and we want to test whether the population has a particular distribution (such as normal). In this case we could divide the range of the data into K intervals and count how many of the X's fall into each interval. For example, if the hypothesis is that the X's come from a uniform distribution on the interval [0,1], we could divide [0,1] into the 10 intervals [0,.1), [.1,.2), and so on and then count how many X's are in each interval. Again we call the observed counts .
From the hypothesized distribution, we can calculate how many X's should be in each interval (for the uniform example, 10% of the X's should fall in each interval). Again we have where is the probability that an X falls in the ith interval.
In some cases, we need to estimate the parameters of the hypothesized distribution. In testing for normality for example, in order to find the 's, we need to know the mean and variance of the population. If we use and as estimates of the true mean and variance, then we must further reduce the degrees of freedom of the statistic by 2 (one for each estimated parameter).
EXAMPLE:\ To see if there is a seasonal effect for homicide, 1361 crimes were classified into the four seasons, where 334 of them happened in spring, 372 in summer, 327 in Fall and 328 in winter. Do we have enough evidence to show that the crime frequencies are different for different seasons? Let , be the proportions of crimes for the four seasons, respectively.
We now turn to the case where we have objects that can be categorized in two ways (such as gender and political party).
A twoway table with r rows and c columns contains sample counts. Let denote the number of observations in the ith row and the jth column, , . The general form of the data table giving the sample counts is as follows:
In this table are the column totals and are the row totals. Thus,
If the total sample size is n, then
For most tables, a better understanding of the information contained in the table is obtained by examining the column proportions/ which are defined as the jth column proportion . The resulting entries in each column form the conditional distribution/ of the row variable given that value of the column variable. Note that the sum of the entries in each column should be 1 (making exception only for round off error). Tables giving entries as proportions of row totals are also useful. Which description to use depends upon the particular set of data being analyzed and what questions are of interest.
The test procedure for count data is sufficiently general so that it is valid for different assumptions regarding the data. The one assumption that must remain stringent however is that each experimental unit be counted only once in the data table.
The following table is a summary of the population proportions where a single SRS is taken from a single population and each observation is classified into one cell of an table.
The marginal proportions in this table are the sums of the proportions in the rows and columns. Here the are the row sums and the are the column sums. The marginal proportions are easily interpreted as probabilities. Each is the probability that a randomly selected member of the population falls in the ith row category. Similarly, each is the probability that a randomly selected member of the population falls in the jth column category.
A SRS of size n is drawn from a population. Each individual in the sample is classified according to two categorical variables. The probabilities for the row classification are and the probabilities for the column classification are .
The null hypothesis is the the row and column classifications are independent; that is, there is no relationship between the row and column classifications. Letting denote the probability of an observation being classified in row i and in column j, the null hypothesis is
The alternative hypothesis is the the row and column classifications are dependent; that is, the row and column classifications are related in some way. We write this alternative as
The second model is a natural extension of the comparison of two proportions we studied in Section 9.2. That is, the c populations are independently sampled and the number of possible outcomes in each population is r where .
For each of c populations, independent SRSs of sizes are drawn. Each individual in a sample is classified according to a categorical outcome variable with r possible values. For the jth population the probability that an individual will fall into category i is .
The null hypothesis is that the distributions of the outcome variable are the same in all c populations. Letting denote the proportion of population j in category i, the null hypothesis is
The alternative hypothesis is : at least one of the equalities in does not hold.
The samples sizes from each of the populations are the column totals in the sample count table. Call these sample sizes . In the first model, the are random variables. The total samples size n is set by the researcher, and the column sums are known only after the data are analyzed. For the second model, the column sums are the sample sizes selected at the design phase of the research. The null hypothesis in both models says that there is no relationship between the column variable and the row variable. Although the hypothesis is expressed differently, the test of the hypothesis in each case is the same.
The statistic that tests in tables compares the sample counts with expected/ counts that are calculated under the assumption that is true. The expected count in the ijth cell of the table is denoted by . For an table, the expected counts/ are calculated from the marginal totals in the samples count table using the formula
To test , that there is no relationship between the row and column classifications, a statistic called the chisquare statistic/ is used. This statistic compares the sample counts with their expected values. Specifically, we take the difference between the sample count and its expected count, square these values, and divide by the expected count, then sum over all entries. That is, to compare the sample and expected counts we use a statistic , called the chisquare statistic. It is calculated from the following formula:
where observed/ represents the sample counts, and expected/ represents the expected counts, and the sum is over all entries in the sample or expected count tables.
To test , we need a distribution to compare to, under the assumptions that is true. This leads us to the chisquared distribution. The distribution is described by a single parameter, its degrees of freedom. Furthermore, the distribution is skewed to the right.
The data for an table can be obtained by random sampling as described by either of the two models previously discussed.
The null hypothesis to be tested is that the row and column classifications are independent (first model) or that the row classification proportions for the c populations are all equal (second model). The alternative hypothesis is that the null hypothesis is not true.
The test statistic is the statistic
If is true, the statistic has approximately a distribution with (r  1)(c  1) degrees of freedom.
The pvalue for the test is where is a random variable having the distribution. The approximation is based on having a large sample. The sample is judged large enough if the average of the expected counts is 5 or more, and the smallest expected count is 1 or more.
where is the ith row total and is the jth column total.
EXAMPLE: Each of 250 job applicants at a large firm was classified in two ways: (1) whether or not they got a job offer; and (2) their ethnic group.
Do the data indicate that receiving a job offer is independent of the ethnicity of the applicant? We obtain the following output from Stataquest:
 col row 1 2 3  Total ++ 1  24 13 18  55 2  124 39 32  195 ++ Total 148 52 50  250 Pearson chi2(2) = 8.8688 Pr = 0.012
Applicable StataQuest Commands:
Summaries Tables Twoway (crosstabulation)
The webmaster and author of this Math Help site is Graeme McRae.