If SEX and DEATHPEN are independent, the expected count \(E_{ij}\) for any cell is the row total times the column proportion, which is equivalent to the column total times the row proportion: \[E_{ij} = R_i(C_j/n) = C_j(R_i/n) = R_iC_j/n.\]
If SEX and DEATHPEN are independent, then the observed and the expected values for all cells should be similar. The degree of similarity is measured with the statistic \[\sum_{cells}\dfrac{(OBS - EXP)^2}{EXP}\]
The hypotheses to be tested are written as follows:
\(H_0:\) Death penalty opinion is independent of gender (there is no association between death penalty opinion and gender).
\(H_A:\) Death penalty opinion is dependent on gender (there is an association between death penalty opinion and gender).
R
interludeConsider how the outer()
function works, which we will use to compute expected values.
outer(1:3, 1:3)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
outer(1:3, 3:1)
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 6 4 2
[3,] 9 6 3
outer(1:3, 3:1, "+")
[,1] [,2] [,3]
[1,] 4 3 2
[2,] 5 4 3
[3,] 6 5 4
outer(1:3, 3:1, "^")
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 8 4 2
[3,] 27 9 3
The expected counts for T1
are
EC <- function(TAB){
outer(apply(TAB,1 , sum), apply(TAB, 2, sum))/sum(TAB)
}
EC(T1)
Strongly Agree Agree Neither Agree nor Disagree Disagree
Male 233.2394 113.5962 37.14554 44.92019
Female 306.7606 149.4038 48.85446 59.07981
Strongly Disagree
Male 31.09859
Female 40.90141
To remove all of the NA
values from DPS
, use na.omit()
as follows.
str(DPS)
'data.frame': 45974 obs. of 3 variables:
$ CASEID : int 1 3 4 5 6 7 8 9 10 11 ...
$ DEATHPEN: Factor w/ 5 levels "Strongly Agree",..: NA NA NA NA NA NA NA NA NA NA ...
$ SEX : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 1 2 2 2 ...
DPSC <- na.omit(DPS)
str(DPSC)
'data.frame': 1065 obs. of 3 variables:
$ CASEID : int 26266 26267 26268 26270 26271 26272 26274 26275 26278 26279 ...
$ DEATHPEN: Factor w/ 5 levels "Strongly Agree",..: 1 1 2 1 2 3 2 1 1 2 ...
$ SEX : Factor w/ 2 levels "Male","Female": 2 2 1 2 1 1 2 2 1 1 ...
- attr(*, "na.action")=Class 'omit' Named int [1:44909] 1 2 3 4 5 6 7 8 9 10 ...
.. ..- attr(*, "names")= chr [1:44909] "1" "2" "3" "4" ...
Note that DPSC
has 1065 observations where DPS
has 45974 observations, but 44909 of the observations in DPS
are NA
. To perform a permutation test, we will need to use the cleaned up version of the data frame DPSC
.
To perform a permutation test for independence between two categorical variables, the data will need to be stored in two columns. For example, if a contingency tables has 1065 entries, that data will need to be stored in an object with 1065 rows and a column for each categorical variable. If the null hypothesis that SEX
and DEATHPEN
are independent is correct, then we could permute either the SEX
or DEATHPEN
values, and any other permutation would be equally likely. For each permutation resample, we will obtain a contingency table and compute the statistic
\[\sum_{cells}\frac{(OBS - EXP)^2}{EXP}\]
for that permutation. Note that for every resample, the row and column totals in the contingency table are the same; only the counts in the table change.
set.seed(123)
addmargins(table(DPSC$SEX, sample(DPSC$DEATHPEN)))
Strongly Agree Agree Neither Agree nor Disagree Disagree
Male 256 107 27 48
Female 284 156 59 56
Sum 540 263 86 104
Strongly Disagree Sum
Male 22 460
Female 50 605
Sum 72 1065
addmargins(table(DPSC$SEX, sample(DPSC$DEATHPEN)))
Strongly Agree Agree Neither Agree nor Disagree Disagree
Male 230 115 38 43
Female 310 148 48 61
Sum 540 263 86 104
Strongly Disagree Sum
Male 34 460
Female 38 605
Sum 72 1065
addmargins(xtabs(~sample(SEX) + DEATHPEN, data = DPSC))
DEATHPEN
sample(SEX) Strongly Agree Agree Neither Agree nor Disagree Disagree
Male 232 117 35 47
Female 308 146 51 57
Sum 540 263 86 104
DEATHPEN
sample(SEX) Strongly Disagree Sum
Male 29 460
Female 43 605
Sum 72 1065
addmargins(xtabs(~sample(SEX) + DEATHPEN, data = DPSC))
DEATHPEN
sample(SEX) Strongly Agree Agree Neither Agree nor Disagree Disagree
Male 218 126 38 46
Female 322 137 48 58
Sum 540 263 86 104
DEATHPEN
sample(SEX) Strongly Disagree Sum
Male 32 460
Female 40 605
Sum 72 1065
To perform the permutation test, randomly permute one of the categorical variables. Each time the categorical variable is permuted, compute the statistic \(\sum_{cells}\frac{(OBS - EXP)^2}{EXP}\) on the resulting contingency table and store the result. Repeat this process a large number of times. The \(p\)-value will be the fraction of times the simulated statistics equal or exceed the value of the observed statistic.
N <- 10^4 - 1 # Change this for slower computers
result <- numeric(N)
set.seed(3)
for(i in 1:N){
# T2 <- table(sample(DPSC$SEX), DPSC$DEATHPEN)
T2 <- xtabs(~sample(SEX) + DEATHPEN, data =DPSC)
result[i] <- chisq.test(T2)$statistic
}
obs <- chisq.test(xtabs(~SEX + DEATHPEN, data = DPSC))$statistic
pvalue <- (sum(result >= obs) + 1)/(N + 1)
pvalue
[1] 0.0096
pvalueCH <- chisq.test(xtabs(~SEX + DEATHPEN, data = DPSC))$p.value
pvalueCH
[1] 0.009966395
# Or chisq.test(DPSC$SEX, DPSC$DEATHPEN)$p.value
hist(result, breaks = "Scott", col = "pink", freq=FALSE, main = "")
curve(dchisq(x, 4), 0, 20, add=TRUE, col = "red", lwd = 4)
library(ggplot2)
DF <- data.frame(x = result)
p <- ggplot(data = DF, aes(x= x)) + geom_density(fill = "pink") + theme_bw()
p + stat_function(fun = dchisq, args = list(df = 4), color = "red", lwd = 1)
The simulated permutation \(p\)-value is 0.0096. The \(p\)-value that is returned from the chisq.test()
is 0.0099664. In this case, the two \(p\)-values are fairly similar. This will not always be the case.
Some times, you will have only access to data that has been summarized (contingency tables). To get your data into the needed format, you may want to use the following function written by Marc Schwartz which will take a contingency table and convert it to a flat file. The function expand.dft()
is also in the package vcdExtra
.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$Freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -Freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
DF
}
The x=
argument to expand.dft()
is a table that has been converted to a data frame. In the following example, a matrix of values is created that resembles a contingency table. To convert the matrix to a table, we use the function as.table()
.To convert the table to a data frame, we use the function as.data.frame()
. Finally, we apply the expand.dft()
function to the contingency table that was converted to a data frame and stored in the object HADF
.
HA <- c(110, 277, 50, 163, 302, 63)
HAM <- matrix(data = HA, nrow = 2, byrow = TRUE)
dimnames(HAM) <- list(Gender = c("Male", "Female"), Giddy = c("Very Happy", "Pretty Happy", "Not to Happy"))
HAM
Giddy
Gender Very Happy Pretty Happy Not to Happy
Male 110 277 50
Female 163 302 63
HAT <- as.table(HAM)
HAT
Giddy
Gender Very Happy Pretty Happy Not to Happy
Male 110 277 50
Female 163 302 63
addmargins(HAT)
Giddy
Gender Very Happy Pretty Happy Not to Happy Sum
Male 110 277 50 437
Female 163 302 63 528
Sum 273 579 113 965
HADF <- as.data.frame(HAT)
HATflatfile <- vcdExtra::expand.dft(HADF)
head(HATflatfile)
Gender Giddy
1 Male Very Happy
2 Male Very Happy
3 Male Very Happy
4 Male Very Happy
5 Male Very Happy
6 Male Very Happy
str(HATflatfile)
'data.frame': 965 obs. of 2 variables:
$ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Giddy : Factor w/ 3 levels "Not to Happy",..: 3 3 3 3 3 3 3 3 3 3 ...
From the http://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10 page, click Download > Customized Subset. Click in the radio button to the left of CSV file (Comma Separated Values with header record). Type DEATHPEN DEGREE
in the Enter names of individual variables (original or created) to include box. Click in the radio button All for CASE IDENTIFICATION AND YEAR. Scroll to the bottom of the page, and click the Continue button. Review the page that appears to make sure you have selected the desired variables; then, click the Create the Files
button. You should have two hot links: the Data file
and the Codebook
. By clicking on the Data file
link, the information should open in your browser. To read the data into R
, copy the URL and store it as follows.
# site <- "http://sda.berkeley.edu/TMPDIR/AAO0mzEh.csv"
# download.file(url = site, destfile = "../data/dpy.csv")
dpy <- read.csv(file = "../data/dpy.csv")
str(dpy)
'data.frame': 55087 obs. of 5 variables:
$ CASEID : int 1 2 3 4 5 6 7 8 9 10 ...
$ DEATHPEN: int 0 0 0 0 0 0 0 0 0 0 ...
$ DEGREE : int 3 0 1 3 1 1 1 3 1 1 ...
$ YEAR : int 1972 1972 1972 1972 1972 1972 1972 1972 1972 1972 ...
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
xtabs(~DEATHPEN + DEGREE, data = dpy)
DEGREE
DEATHPEN 0 1 2 3 4 8 9
0 12106 27564 2854 7450 3595 30 129
1 120 381 30 79 29 0 3
2 78 168 16 52 12 0 1
3 20 63 4 14 3 0 0
4 23 74 9 22 9 0 1
5 12 36 4 20 12 0 0
8 17 23 2 11 5 0 1
9 3 2 0 0 0 0 0
What is the only YEAR these two questions were asked?
There are several ways to answer this question, here is the first one that comes to mind.
YQ <- subset(x = dpy, subset = DEATHPEN != 0 & DEGREE != 0)
head(YQ)
CASEID DEATHPEN DEGREE YEAR ID
26266 26266 1 1 1991 1
26267 26267 1 4 1991 2
26268 26268 2 4 1991 3
26269 26269 8 3 1991 4
26270 26270 1 1 1991 5
26273 26273 1 3 1991 8
xtabs(~YEAR, data = YQ)
YEAR
1991
1086
To satisfy the curious, the answer is 1991.
women <- c(35, 146)
men <- c(8, 97)
stuff <- rbind(women, men)
dimnames(stuff) <- list(Gender =c("Women", "Men"), Diet=c("Yes", "No"))
stuff
Diet
Gender Yes No
Women 35 146
Men 8 97
stuffT <- as.table(stuff)
stuffDF <- as.data.frame(stuffT)
head(stuffDF)
Gender Diet Freq
1 Women Yes 35
2 Men Yes 8
3 Women No 146
4 Men No 97
DFL <- vcdExtra::expand.dft(stuffDF)
head(DFL)
Gender Diet
1 Women Yes
2 Women Yes
3 Women Yes
4 Women Yes
5 Women Yes
6 Women Yes
set.seed(2)
N <- 10^4 - 1 # Change this for slower computers
result <- numeric(N)
for (i in 1:N) {
T2 <- xtabs(~sample(Gender) + Diet, data = DFL)
result[i] <- chisq.test(T2)$statistic
}
obs <- chisq.test(xtabs(~ Gender + Diet, data = DFL))$statistic
pvalue <- (sum(result >= obs) + 1)/(N + 1)
pvalue
[1] 0.0092
The major difference between testing for homogeneity and testing for independence is that you will have samples from two or more populations when testing for homogeneity. Recall that we only had one sample when we tested for independence. Consider testing whether the proportion of boys is the same as the proportions of girls that favor three different flavors of candy. That is, \(H_0: \pi_{B1} = \pi_{G1}, \pi_{B2} = \pi_{G2}, \pi_{B3} =\pi_{G3}\) versus \(H_A:\) at least one of the equalities does not hold.
candy <- c(42, 20, 38, 33, 27, 50)
candyM <- matrix(data = candy, nrow = 2, byrow = TRUE)
dimnames(candyM) <- list(Gender = c("Boys", "Girls"), Flavor = c("Flavor 1", "Flavor 2", "Flavor 3"))
candyM
Flavor
Gender Flavor 1 Flavor 2 Flavor 3
Boys 42 20 38
Girls 33 27 50
candyT <- as.table(candyM)
candyT
Flavor
Gender Flavor 1 Flavor 2 Flavor 3
Boys 42 20 38
Girls 33 27 50
addmargins(candyT)
Flavor
Gender Flavor 1 Flavor 2 Flavor 3 Sum
Boys 42 20 38 100
Girls 33 27 50 110
Sum 75 47 88 210
candyDF <- as.data.frame(candyT)
candyflatfile <- vcdExtra::expand.dft(candyDF)
head(candyflatfile)
Gender Flavor
1 Boys Flavor 1
2 Boys Flavor 1
3 Boys Flavor 1
4 Boys Flavor 1
5 Boys Flavor 1
6 Boys Flavor 1
str(candyflatfile)
'data.frame': 210 obs. of 2 variables:
$ Gender: Factor w/ 2 levels "Boys","Girls": 1 1 1 1 1 1 1 1 1 1 ...
$ Flavor: Factor w/ 3 levels "Flavor 1","Flavor 2",..: 1 1 1 1 1 1 1 1 1 1 ...
E <- chisq.test(candyT)$expected
E
Flavor
Gender Flavor 1 Flavor 2 Flavor 3
Boys 35.71429 22.38095 41.90476
Girls 39.28571 24.61905 46.09524
obsstat <- chisq.test(candyT)$statistic
obsstat
X-squared
3.290187
# Now we will run a permutation test.
N <- 10^4 - 1 # Change this for slower computers
result <- numeric(N)
set.seed(1)
for (i in 1:N) {
T2 <- xtabs(~sample(Gender) + Flavor, data = candyflatfile)
result[i] <- chisq.test(T2)$statistic
}
pvalue <- (sum(result >= obsstat) + 1)/(N + 1)
pvalue
[1] 0.1941
In this case, there is no evidence to suggest the proportions of boys favoring flavors one, two, and three is any different than the proportions of girls favoring flavors one, two and three.
A quality engineer has taken 50 samples of size 13 each from a production process. The numbers of defectives for these samples are given in the code. Test the null hypothesis at an \(\alpha =0.05\) level that the number of defective follows (a) the Poisson distribution, (b) the binomial distribution.
numberDefectives <- c(0, 1, 2, 3, 4, 5, "6 or more")
numberSamples <- c(10, 24, 10, 4, 1, 1, 0)
names(numberSamples) <- numberDefectives
numberSamples
0 1 2 3 4 5 6 or more
10 24 10 4 1 1 0
Since no parameters are specified, they must be estimated from the data in order to carry out the test in both (a) and (b).
See Wikipedia for a summary of various probability distributions. The Poisson pdf is \(P(X =x|\lambda) = \frac{\lambda e^{-\lambda}}{x!}\) with \(E[X] =\lambda\). An estimate of the mean number of defectives in the 50 samples is 1.3.
muest <- sum(0:5*numberSamples[-7]/sum(numberSamples[-7]))
muest
[1] 1.3
weighted.mean(x = 0:5, w = numberSamples[-7]/sum(numberSamples[-7]))
[1] 1.3
Using \(\hat{\lambda} = 1.3\), we compute the probabilities for each category and subsequent expected values.
ps <- dpois(0:4, muest)
p5m <- 1 - ppois(4, muest)
psf <- c(ps, p5m)
psf
[1] 0.27253179 0.35429133 0.23028937 0.09979206 0.03243242 0.01066303
exh <- psf*50
exh
[1] 13.6265897 17.7145665 11.5144683 4.9896029 1.6216209 0.5331517
((numberSamples[-7] - exh)^2/exh)
0 1 2 3 4 5
0.9651830 2.2301801 0.1991941 0.1962709 0.2382879 0.4087905
stat <- sum((numberSamples[-7] - exh)^2/exh)
stat
[1] 4.237906
epvalue <- pchisq(stat, 6-1-1, lower = FALSE)
epvalue
[1] 0.3747651
Since the \(p\)-value is 0.3747651, we can not reject the null hypothesis that the distribution follows a Poisson distribution. That is, there is no evidence to suggest the distribution is something other than the Poisson distribution.
One really needs the expected cell counts to be at least five. Consider collapsing the number of defectives greater than or equal to 3 to a single category.
ps <- dpois(0:2, muest)
p3m <- 1 - ppois(2, muest)
psf <- c(ps, p3m)
psf
[1] 0.2725318 0.3542913 0.2302894 0.1428875
exh <- psf*50
exh
[1] 13.626590 17.714567 11.514468 7.144376
((numberSamples[-c(7, 6, 5)] + c(0, 0, 0, 2) - exh)^2/exh)
0 1 2 3
0.9651830 2.2301801 0.1991941 0.1833044
stat <- sum((numberSamples[-c(7, 6, 5)] + c(0, 0, 0, 2) - exh)^2/exh)
stat
[1] 3.577862
epvalue2 <- pchisq(stat, 4 - 1 - 1, lower = FALSE)
epvalue2
[1] 0.1671388
# Could use the following but the degrees of freedom will be incorrect for this test!
chisq.test(c(10, 24, 10, 6), p = psf)
Chi-squared test for given probabilities
data: c(10, 24, 10, 6)
X-squared = 3.5779, df = 3, p-value = 0.3108
Note the substantial drop in \(p\)-value (0.1671388) although the final conclusion is still the same.
For part (b), the null hypothesis is that the number of defectives in each sample of 13 follows the binomial distribution with \(n = 13\) and \(\pi\) equal the probability of a defective in any sample. An estimate of \(\pi\), \(\hat{\pi}\) is the total number of defectives (65) divided by the total number of observations (650). That is \(\hat{\pi} = 0.1\).
pihat <- 65/650
ps <- dbinom(0:2, 13, pihat)
p3m <- 1 - pbinom(2, 13, pihat)
psf <- c(ps, p3m)
psf
[1] 0.2541866 0.3671584 0.2447723 0.1338828
exh <- psf*50
exh
[1] 12.709329 18.357920 12.238613 6.694138
((numberSamples[-c(7, 6, 5)] + c(0, 0, 0, 2) - exh)^2/exh)
0 1 2 3
0.57756506 1.73402370 0.40947362 0.07197749
stat <- sum((numberSamples[-c(7, 6, 5)] + c(0, 0, 0, 2) - exh)^2/exh)
stat
[1] 2.79304
epvalue3 <- pchisq(stat, 4 - 1 - 1, lower = FALSE)
epvalue3
[1] 0.2474566
This example illustrates a common result with chi-square goodness-of-fit tests, i.e., that each to two (or more) different null hypotheses may be accepted for the same data set. Obviously, the true distribution cannot be both binomial and Poisson at the same time which is why the conclusion is that there is not sufficient evidence to suggest the alternative. This does not make the null hypothesis true!