1 Can A New Drug Reduce the Spread of Schistosomiasis?

Schistosomiasis (skis-tuh-soh-may’-uh-sis) is a disease in humans caused by parasitic flatworms called schistosomes (skis’-tuh-sohms). Schistosomiasis affects about 200 million people worldwide and is a serious problem in sub-Saharan Africa, South America, China, and Southeast Asia. The disease can cause death, but more commonly results in chronic and debilitating symptoms, arising primarily from the body’s immune reaction to parasite eggs lodged in the liver, spleen,and intestines.

Currently there is one drug, praziquantel (pray’-zee-kwan-tel), in common use for treatment of schistosomiasis; it is inexpensive and effective. However, many organizations are concerned about relying on a single drug to treat a serious disease that affects so many people worldwide. Drug resistance may have prompted an outbreak in the 1990s in Senegal, where cure rates were low. In 2007, several researchers published work on a promising drug called K11777, which, in theory, might treat schistosomiasis.

gender <- c(rep("Female", 10), rep("Male", 10)) 
group <- rep(rep(c("Treatment", "Control"), each = 5), 2)
worms <- c(1, 2, 2, 10, 7, 16, 10, 10, 7, 17, 3, 5, 9, 10, 6, 31, 26, 28, 13, 47)
schis <- data.frame(gender, group, worms)
head(schis, n = 3)
  gender     group worms
1 Female Treatment     1
2 Female Treatment     2
3 Female Treatment     2
library(ggplot2)
p <- ggplot(data = schis, aes(group, worms)) + 
  geom_point(position = "jitter", aes(color = group)) + 
  facet_grid(cols = vars(gender)) + 
  theme_bw()
p

  1. Use the previous graph to visually compare the number of worms for the treatment and control groups for both the male and female mice. Do each of the four groups appear to have a similar center and similar spread? Are there any outliers (extreme observations that don’t fit with the rest of the data)?

  2. Calculate appropriate summary statistics (e.g., the median, mean, and standard deviation) for each of the four groups. For the female mice, calculate the difference between the treatment and control means. Do the same for the male mice.

library(dplyr)
ans <- schis %>%
  group_by(gender, group) %>%
  summarize(Median = median(worms), Mean = mean(worms), SD = sd(worms))
ans
# A tibble: 4 × 5
# Groups:   gender [2]
  gender group     Median  Mean    SD
  <fct>  <fct>      <dbl> <dbl> <dbl>
1 Female Control       10  12    4.30
2 Female Treatment      2   4.4  3.91
3 Male   Control       28  29   12.2 
4 Male   Treatment      6   6.6  2.88

The descriptive analysis in Questions 1 and 2 points to a positive treatment effect: K11777 appears to have reduced the number of parasite worms in this sample. But descriptive statistics are usually only the first step in ascertaining whether an effect is real; we often conduct a significance test or create a confidence interval to determine if chance alone could explain the effect.

We will introduce the basic concepts of randomization tests in a setting where units (mice in this example) are randomly allocated to a treatment or control group. Using a significance test, we will decide if an observed treatment effect (the difference between the mean responses in the treatment and control) is “real” or if “random chance alone” could plausibly explain the observed effect. The null hypothesis states that “random chance alone” is the reason for the observed effect. In this initial discussion, the alternative hypothesis will be one-sided because we want to show that the true treatment mean (\(\mu_{treatment}\)) is less than the true control mean (\(\mu_{control}\)).

1.1 Statistical Inference Through a Randomization Test

Whether they take the form of significance tests or confidence intervals, inferential procedures rest on the fundamental question for inference: “What would happen if we did this many times?” Let’s unpack this question in the context of the female mice schistosomiasis. We observed a difference in means of 7.6 = 12 - 4.4 worms between the control and treatment groups. While we expect that this large difference reflects the effectiveness of the drug, it is possible that chance alone could explain this difference. This “chance alone” position is usually called the null hypothesis and includes the following assumptions:

  • The number of parasitic worms found in the liver naturally varies from mouse to mouse.
  • Whether or not the drug is effective, there clearly is variability in the responses of mice to the infestation of schistosomes.
  • Each group exhibits this variability, and even if the drug is not effective, some mice do better than others.
  • The only explanation for the observed difference of 7.6 worms in the means is that the random allocation randomly placed mice with larger numbers of worms in the control group and mice with smaller numbers of worms in the treatment group.

In this study, the null hypothesis is that the treatment has no effect on the average worm count, and it is denoted as

\(H_0: \mu_{control} = \mu_{treatment}\) Another way to write this hypothesis is

\(H_0:\) The treatment has no effect on average worm count. Or

\(H_0: \mu_{control} - \mu_{treatment} = 0\)

Alternative hypotheses can be “one-sided, greater-than” (as in this investigation), “one-sided, less-than” (the treatment causes a decrease in worm count), or “two-sided” (the treatment is different, in one direction or the other, from the mean). We chose to test a one-sided hypothesis because there is a clear research interest in one direction. In other words, we will take action (start using the drug) only if we can show that K11777 reduces worm count.

\(H_A: \mu_{control} - \mu_{treatment} > 0\)

The fundamental question for inference: Every statistical inference procedure is based on the question “How does what we observed in our data compare to what would happen if the null hypothesis were actually true and we repeated the process many times?”

For a randomization test comparing responses for the two groups, this question becomes “How does the observed difference between groups compare to what would happen if the treatments actually had no effect on the individual responses and we repeated the random allocation of individuals to groups many times?”

ND <- schis %>% 
  filter(gender == "Female")
ND
   gender     group worms
1  Female Treatment     1
2  Female Treatment     2
3  Female Treatment     2
4  Female Treatment    10
5  Female Treatment     7
6  Female   Control    16
7  Female   Control    10
8  Female   Control    10
9  Female   Control     7
10 Female   Control    17
ggplot(data = ND, aes(x = group, y = worms, color = group)) + 
  geom_point(position = "jitter") + 
  theme_bw()

ggplot(data = ND, aes(x = group, y = worms, color = group)) + 
  geom_boxplot() +
  geom_jitter() +
  theme_bw()

ND %>%
  group_by(group) %>%
  summarize(Mean = mean(worms), SD = sd(worms), 
            Median = median(worms), iqr = IQR(worms)) -> ans2
ans2
# A tibble: 2 × 5
  group      Mean    SD Median   iqr
  <fct>     <dbl> <dbl>  <dbl> <dbl>
1 Control    12    4.30     10     6
2 Treatment   4.4  3.91      2     5
-diff(ans2$Mean) -> obs_diff
obs_diff
[1] 7.6

1.2 Generating a Simulated Permutation Distribution

Worms <- ND$worms
sims <- 10^4-1
sim_diff <- numeric(sims)
for(i in 1:sims){
  index <- sample(10, 5, replace = FALSE)
  sim_diff[i] <- mean(Worms[index]) - mean(Worms[-index])
}
pvalue <- (sum(sim_diff >= obs_diff) + 1)/(sims + 1)
pvalue  # pvalue < alpha value -> reject HO
[1] 0.0262

1.2.1 Visualizing the Simulated Permutation Distribution

1.2.1.1 Using base R:

hist(sim_diff)

Improving the histogram with color, labels, and a title.

hist(sim_diff, col = "pink", main = "Simulated Permutation Distribution",
     xlab = expression(bar(x)[Control] - bar(x)[Treatment]))

Coloring in the p-value.

h <- hist(sim_diff, breaks = 20, plot = FALSE)
cuts <- (cut(h$breaks, c(-Inf,obs_diff, Inf)))
plot(h, col = cuts, main = "Simulated Permutation Distribution",
     xlab = expression(bar(x)[Control] - bar(x)[Treatment]))

1.2.1.2 Using ggplot

Basic histogram

 # Creating data frame DF
DF <- data.frame(sim_diff = sim_diff)
ggplot(data = DF, aes(x = sim_diff)) +
  geom_histogram() 

Adding some labels and coloring the p-value.

ggplot(data = DF, aes(x = sim_diff)) +
  geom_histogram(aes(fill = sim_diff >= 7.6), alpha = 0.6) + 
  guides(fill = "none") +
  scale_fill_manual(values = c("purple", "red")) +
  labs(x = expression(bar(x)[Control] - bar(x)[Treatment]), y = "", title = "Simulation-Based Null Distribution") +
  theme_bw() 

Creating a density plot:

x.dens <- density(sim_diff)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
ggplot(data = DF, aes(x = sim_diff)) +
  geom_density(fill = "purple", alpha = 0.2) + 
  theme_bw() + 
  geom_area(data = subset(df.dens, x >= obs_diff & x <= max(sim_diff)), aes(x = x, y = y), fill = "red", alpha = .7) + 
  labs(x = expression(bar(x)[Control] - bar(x)[Treatment]), y = "", title = "Simulation-Based Null Distribution")

1.3 Using infer

In this section, we’ll now show you how to use the infer package to test a hypothesis. Steps include:

  1. specify() the variables of interest in your data frame.
  2. hypothesize() specify the null hypothesis.
  3. generate() number of replicates and type resampling to use.
  4. calculate() the summary statistic(s) of interest.
  5. visualize() the resulting distribution.

The infer process is graphically depicted in Figure 1.1.

Hypothesis testing with the `infer` package.

Figure 1.1: Hypothesis testing with the infer package.

library(infer)

ND %>% 
  specify(worms ~ group) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 10^4-1, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("Control", "Treatment")) -> some
some
Response: worms (numeric)
Explanatory: group (factor)
Null Hypothesis: independence
# A tibble: 9,999 × 2
   replicate   stat
       <int>  <dbl>
 1         1  1.6  
 2         2  4.8  
 3         3  0    
 4         4 -0.400
 5         5 -2    
 6         6 -0.8  
 7         7 -1.6  
 8         8  0.400
 9         9  2    
10        10  7.6  
# … with 9,989 more rows
get_p_value(some, obs_stat = obs_diff, direction = "greater")
# A tibble: 1 × 1
  p_value
    <dbl>
1  0.0286
visualize(some) + 
  shade_p_value(obs_stat = obs_diff, direction = "greater") + 
  theme_bw() + 
  labs(x = expression(bar(x)[Control] - bar(x)[Treatment]))

1.4 Theoretical Permutation Distribution

# Given a vector of length n + m
# Take a resample of size m without replacement.
n <- 5
m <- 5
ncb <- choose(n + m, m)
ncb
[1] 252
CB <- t(combn(n + m, m))
head(CB)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    1    2    3    4    6
[3,]    1    2    3    4    7
[4,]    1    2    3    4    8
[5,]    1    2    3    4    9
[6,]    1    2    3    4   10
nn <- dim(CB)[1]
nn
[1] 252
#
ND$worms
 [1]  1  2  2 10  7 16 10 10  7 17
diffmeans <- numeric(nn)
for(i in 1:nn){
  diffmeans[i] <- mean(ND$worms[CB[i, ]]) - mean(ND$worms[-CB[i,]])
}
sort(diffmeans)
  [1] -8.8 -7.6 -7.6 -7.6 -7.6 -7.6 -7.6 -6.4 -6.4 -6.4 -5.6 -5.6 -5.6 -5.6 -5.6
 [16] -5.6 -5.2 -5.2 -5.2 -5.2 -5.2 -4.8 -4.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4
 [31] -4.4 -4.4 -4.4 -4.4 -4.4 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -3.6
 [46] -3.6 -3.6 -3.2 -3.2 -3.2 -3.2 -2.8 -2.8 -2.8 -2.8 -2.4 -2.4 -2.4 -2.4 -2.0
 [61] -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0
 [76] -2.0 -2.0 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6
 [91] -1.6 -1.6 -1.6 -1.6 -1.6 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -0.8
[106] -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4
[121] -0.4 -0.4 -0.4  0.0  0.0  0.0  0.0  0.0  0.0  0.4  0.4  0.4  0.4  0.4  0.4
[136]  0.4  0.4  0.4  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  1.2  1.2
[151]  1.2  1.2  1.2  1.2  1.2  1.2  1.2  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6
[166]  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  2.0  2.0  2.0  2.0  2.0
[181]  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.4  2.4
[196]  2.4  2.4  2.8  2.8  2.8  2.8  3.2  3.2  3.2  3.2  3.6  3.6  3.6  4.0  4.0
[211]  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.4  4.4  4.4  4.4  4.4  4.4  4.4  4.4
[226]  4.4  4.4  4.4  4.4  4.8  4.8  5.2  5.2  5.2  5.2  5.2  5.6  5.6  5.6  5.6
[241]  5.6  5.6  6.4  6.4  6.4  7.6  7.6  7.6  7.6  7.6  7.6  8.8
sum(diffmeans >= obs_diff)
[1] 7
theo_pvalue <- mean(diffmeans >= obs_diff)
theo_pvalue
[1] 0.02777778
MASS::fractions(theo_pvalue)
[1] 1/36
#
# Write a function!
rdtf <- function(x, y){
  x <- x[!is.na(x)]
  y <- y[!is.na(y)]
  nx <- length(x)
  ny <- length(y)
  cv <- c(x, y)
  nn <- choose(nx + ny, nx)
  CB <- t(combn(nx + ny, nx))
  DM <- numeric(nn)
  for(i in 1:nn){
    DM[i] <- mean(cv[CB[i, ]]) - mean(cv[-CB[i, ]])
  }
  sort(DM)
}
#
rdtf(c(16, 10, 10, 7, 17), c(1, 2, 2, 10, 7))
  [1] -8.8 -7.6 -7.6 -7.6 -7.6 -7.6 -7.6 -6.4 -6.4 -6.4 -5.6 -5.6 -5.6 -5.6 -5.6
 [16] -5.6 -5.2 -5.2 -5.2 -5.2 -5.2 -4.8 -4.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4
 [31] -4.4 -4.4 -4.4 -4.4 -4.4 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -3.6
 [46] -3.6 -3.6 -3.2 -3.2 -3.2 -3.2 -2.8 -2.8 -2.8 -2.8 -2.4 -2.4 -2.4 -2.4 -2.0
 [61] -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0
 [76] -2.0 -2.0 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6
 [91] -1.6 -1.6 -1.6 -1.6 -1.6 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -0.8
[106] -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4
[121] -0.4 -0.4 -0.4  0.0  0.0  0.0  0.0  0.0  0.0  0.4  0.4  0.4  0.4  0.4  0.4
[136]  0.4  0.4  0.4  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  1.2  1.2
[151]  1.2  1.2  1.2  1.2  1.2  1.2  1.2  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6
[166]  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  2.0  2.0  2.0  2.0  2.0
[181]  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.4  2.4
[196]  2.4  2.4  2.8  2.8  2.8  2.8  3.2  3.2  3.2  3.2  3.6  3.6  3.6  4.0  4.0
[211]  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.4  4.4  4.4  4.4  4.4  4.4  4.4  4.4
[226]  4.4  4.4  4.4  4.4  4.8  4.8  5.2  5.2  5.2  5.2  5.2  5.6  5.6  5.6  5.6
[241]  5.6  5.6  6.4  6.4  6.4  7.6  7.6  7.6  7.6  7.6  7.6  8.8
x <- as.numeric(names(table(diffmeans)))
px <- table(diffmeans)/nn
DF2 <- data.frame(x = x, px = px)

ggplot(data = DF2, aes(x = x, y = px)) +
  geom_linerange(aes(x = x, ymin = 0, ymax = px), size = 0.5, color = "purple") + 
  geom_point(color = "skyblue3", size = 2) + 
  theme_bw() + 
  labs(title = "Permutation Distribution", 
       x = expression(bar(x)[Control] - bar(x)[Treatment]), 
       y = "P(X = x)")

1.5 More on infer

Test whether the average age of respondents in the gss is the same for those who have completed a college degree versus those who have not completed a college degree.

\(H_0: \mu_{college} - \mu_{college^c} = 0\)

\(H_A: \mu_{college} - \mu_{college^c} \neq 0\)

data(gss)
gss %>% 
  group_by(college) %>% 
  summarize(Mean = mean(age), n = n()) -> ans3
ans3
# A tibble: 2 × 3
  college    Mean     n
  <fct>     <dbl> <int>
1 no degree  39.9   326
2 degree     40.9   174
obs_diff <- diff(ans3$Mean) # degree - no degree

gss %>% 
  specify(age ~ college) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 10^4-1, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("degree", "no degree")) -> some
some
Response: age (numeric)
Explanatory: college (factor)
Null Hypothesis: independence
# A tibble: 9,999 × 2
   replicate   stat
       <int>  <dbl>
 1         1  0.262
 2         2  0.808
 3         3  2.42 
 4         4  0.923
 5         5 -0.725
 6         6  1.32 
 7         7 -1.73 
 8         8  0.271
 9         9 -1.95 
10        10  0.623
# … with 9,989 more rows
get_p_value(some, obs_stat = obs_diff, direction = "two-sided")
# A tibble: 1 × 1
  p_value
    <dbl>
1   0.453
visualize(some) + 
  shade_p_value(obs_stat = obs_diff, direction = "two-sided") + 
  theme_bw() + 
  labs(x = expression(bar(x)[Degree] - bar(x)[NoDegree]))

1.6 Testing with a for loop

N <- 10^4 - 1
stat <- numeric(N)
for(i in 1:N){
 stat[i] <- diff(tapply(gss$age, sample(gss$college), mean)) 
}
hist(stat)
abline(v = c(obs_diff, - obs_diff), col = "red")

pvalue <- (sum(stat >= obs_diff)*2 + 1)/(N + 1)
pvalue
[1] 0.4685

1.7 Note

Since there are 326 respondents with no degree and 174 respondents with a degree there are a total of \(7.8299747\times 10^{138}\) possible combinations. If the computer could enumerate 1,000,000 combinations a second, it would take \(2.4828687\times 10^{125}\) years to list all of the possible combinations. That is not happening since the age of the universe is supposedly \(13.82 \times 10^9\) years old.

1.8 Your Turn

  • Test whether males and females work the same number of hours per week using the self reported data from the General Social Survey for 2018 and 1976.
gss %>% 
  filter(year == 2018) %>% 
  group_by(sex) %>% 
  summarize(Mean = mean(hours), n = n()) -> ans4
ans4
# A tibble: 2 × 3
  sex     Mean     n
  <fct>  <dbl> <int>
1 male    43.9    10
2 female  37.9     7
obs_diff <- diff(ans4$Mean) # female - male
obs_diff
[1] -6.042857
gss %>% 
  filter(year == 2018) %>% 
  specify(hours ~ sex) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 10^4-1, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("female", "male")) -> some
some
Response: hours (numeric)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 9,999 × 2
   replicate  stat
       <int> <dbl>
 1         1 -2.4 
 2         2  2.70
 3         3 -3.61
 4         4 -6.29
 5         5 16.1 
 6         6  2.70
 7         7 10.2 
 8         8  8.29
 9         9 -1.19
10        10  5.37
# … with 9,989 more rows
get_p_value(some, obs_stat = obs_diff, direction = "two-sided")
# A tibble: 1 × 1
  p_value
    <dbl>
1   0.425
visualize(some) + 
  shade_p_value(obs_stat = obs_diff, direction = "two-sided") + 
  theme_bw()

gss %>% 
  filter(year == 2018) -> gss2018
N <- 10^4 - 1
stat <- numeric(N)
for(i in 1:N){
 stat[i] <- diff(tapply(gss2018$hours, sample(gss2018$sex), mean)) 
}
hist(stat)
abline(v = c(obs_diff, - obs_diff), col = "red")

pvalue <- (sum(stat <= obs_diff)*2 + 1)/(N + 1)
pvalue
[1] 0.4347

Coloring the p-value.

h <- hist(stat, breaks = 20, plot = FALSE)
cuts <- cut(h$breaks, c(-Inf, obs_diff, -obs_diff, Inf))
mojo <- factor(cuts, labels = c(1, 3, 1))
plot(h, col = mojo, main = "Simulated Permutation Distribution",
     xlab = expression(bar(x)[Female] - bar(x)[Male]))

gss %>% 
  filter(year == 1976) %>% 
  group_by(sex) %>% 
  summarize(Mean = mean(hours), n = n()) -> ans5
ans5
# A tibble: 2 × 3
  sex     Mean     n
  <fct>  <dbl> <int>
1 male    53.9     8
2 female  40.1     7
obs_diff <- diff(ans5$Mean) # female - male
obs_diff
[1] -13.73214
gss %>% 
  filter(year == 1976) %>% 
  specify(hours ~ sex) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 10^4-1, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("female", "male")) -> some
some
Response: hours (numeric)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 9,999 × 2
   replicate   stat
       <int>  <dbl>
 1         1 -5.96 
 2         2 15.2  
 3         3  0.732
 4         4 -6.5  
 5         5 -2.21 
 6         6 -0.607
 7         7 -1.14 
 8         8 12.0  
 9         9 -8.38 
10        10 -4.62 
# … with 9,989 more rows
get_p_value(some, obs_stat = obs_diff, direction = "two-sided")
# A tibble: 1 × 1
  p_value
    <dbl>
1  0.0324
visualize(some) + 
  shade_p_value(obs_stat = obs_diff, direction = "two-sided") + 
  theme_bw() + 
  labs(x = expression(bar(x)[Female] - bar(x)[Male]))

gss %>% 
  filter(year == 1976) -> gss1976
N <- 10^4 - 1
stat <- numeric(N)
for(i in 1:N){
 stat[i] <- diff(tapply(gss1976$hours, sample(gss1976$sex), mean)) 
}
hist(stat)
abline(v = c(obs_diff, - obs_diff), col = "red")

pvalue <- (sum(stat <= obs_diff)*2 + 1)/(N + 1)
pvalue
[1] 0.0329

1.9 More Tests

Consider the data set Verizon from the resampledata package. Test the following:

\(H_0: \mu_{clec} - \mu_{ilec} = 0\)

versus

\(H_0: \mu_{clec} - \mu_{ilec} > 0\)

library(resampledata)
head(Verizon)
   Time Group
1 17.50  ILEC
2  2.40  ILEC
3  0.00  ILEC
4  0.65  ILEC
5 22.23  ILEC
6  1.20  ILEC
Verizon %>% 
  group_by(Group) %>% 
  summarize(Mean = mean(Time), n = n()) -> ans5
ans5
# A tibble: 2 × 3
  Group  Mean     n
  <fct> <dbl> <int>
1 CLEC  16.5     23
2 ILEC   8.41  1664
-diff(ans5$Mean) -> obs_diff # mean_clec - mean_ilec
obs_diff
[1] 8.09752
reps <- 10^4 - 1
### Using infer
Verizon %>% 
  specify(Time ~ Group) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = reps, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("CLEC", "ILEC")) -> permdist
visualize(permdist) + 
  theme_bw() + 
  shade_p_value(obs_stat = obs_diff, direction = "greater")

get_p_value(permdist, obs_stat = obs_diff, direction = "greater")
# A tibble: 1 × 1
  p_value
    <dbl>
1  0.0169
#### Or
x.dens <- density(permdist$stat)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
head(df.dens)
          x            y
1 -8.739466 1.104814e-06
2 -8.687396 1.587874e-06
3 -8.635326 2.243858e-06
4 -8.583257 3.119106e-06
5 -8.531187 4.267078e-06
6 -8.479117 5.783048e-06
ggplot(data = permdist, aes(x = stat, y = ..density..)) + 
  geom_density(fill = "pink", alpha = 0.3) + 
  theme_bw() + 
  geom_area(data = subset(df.dens, x >= obs_diff & x <= max(permdist$stat)), aes(x = x, y = y), fill = "blue", alpha = .7) + 
  labs(x = expression(bar(x)[CLEC] - bar(x)[ILEC]), y = "", title = "Simulation-Based Null Distribution")

reps <- 10^4 - 1
diff_mean <- numeric(reps)
for(i in 1:reps){
  diff_mean[i] <- -diff(tapply(Verizon$Time, sample(Verizon$Group), mean))
}
hist(diff_mean, main = "Simulation-Based Null Distribution")
abline(v = obs_diff, col = "blue")

pvalue <- (sum(diff_mean >= obs_diff) + 1)/(reps + 1)
pvalue
[1] 0.0191