12 Chi-Square Test of Independence7
Please watch the Chapter 12 video below.
The last statistical test that we studied (ANOVA) involved the relationship between a categorical explanatory variable (\(X\)) and a quantitative response variable (\(Y\)). Next, we will consider inferences about the relationships between two categorical variables, corresponding to case \(C \rightarrow C\).
In our graphing, we have already summarized the relationship between two categorical variables for a given data set, without trying to generalize beyond the sample data.
Now we will perform statistical inference for two categorical variables, using the sample data to draw conclusions about whether or not we have evidence that the variables are related in the larger population from which the sample was drawn. In other words, we would like to assess whether the relationship between \(X\) and \(Y\) that we observed in the data is due to a real relationship between \(X\) and \(Y\) in the population, or if it is something that could have happened just by chance due to sampling variability.
The statistical test that will answer this question is called the chi-square test of independence. Chi is a Greek letter that looks like this: \(\chi\), so the test is sometimes referred to as: The \(\chi^2\) test of independence.
Let’s start with an example.
In the early 1970s, a young man challenged an Oklahoma state law that prohibited the sale of 3.2% beer to males under age 21 but allowed its sale to females in the same age group. The case (Craig v. Boren, 429 U.S. 190 [1976]) was ultimately heard by the U.S. Supreme Court.
The main justification provided by Oklahoma for the law was traffic safety. One of the 3 main pieces of data presented to the Court was the result of a “random roadside survey” that recorded information on gender and whether or not the driver had been drinking alcohol in the previous two hours. There were a total of 619 drivers under 20 years of age included in the survey.
The following two-way table summarizes the observed counts in the roadside survey:
No | Yes | Sum | |
---|---|---|---|
Female | 122.00 | 16.00 | 138.00 |
Male | 404.00 | 77.00 | 481.00 |
Sum | 526.00 | 93.00 | 619.00 |
The following code shows how to read the data into a matrix, then convert the matrix to a table, then to a data frame named DF
.
MAT <- matrix(data = c(77, 16, 404, 122), nrow = 2)
dimnames(MAT) <- list(Gender = c("Male","Female"), DroveDrunk = c("Yes", "No"))
library(vcdExtra)
TMAT <- as.table(MAT)
DFTMAT <- as.data.frame(TMAT) # convert to data frame
DF <- vcdExtra::expand.dft(DFTMAT)
xtabs(~Gender + DroveDrunk, data = DF)
DroveDrunk
Gender No Yes
Female 122 16
Male 404 77
addmargins(xtabs(~Gender + DroveDrunk, data = DF))
DroveDrunk
Gender No Yes Sum
Female 122 16 138
Male 404 77 481
Sum 526 93 619
Our task is to assess whether these results provide evidence of a significant (“real”) relationship between gender and drunk driving.
The following figure summarizes this example:
Note that as the figure stresses, since we are looking to see whether drunk driving is related to gender, our explanatory variable (\(X\)) is gender, and the response variable (\(Y\)) is drunk driving. Both variables are two-valued categorical variables, and therefore our two-way table of observed counts is 2-by-2. It should be mentioned that the chi-square procedure that we are going to introduce here is not limited to 2-by-2 situations, but can be applied to any r-by-c situation where r is the number of rows (corresponding to the number of values of one of the variables) and c is the number of columns (corresponding to the number of values of the other variable).
Before we introduce the chi-square test, let’s conduct an exploratory data analysis (that is, look at the data to get an initial feel for it). By doing that, we will also get a better conceptual understanding of the role of the test.