As you saw in the DataCamp video, we are interested in whether the treatment and control groups were equally likely to buy a DVD after reading the experimental statments.
library(tidyverse)
library(infer)
head(opportunity)
decision group
1 buyDVD control
2 buyDVD control
3 buyDVD control
4 buyDVD control
5 buyDVD control
6 buyDVD control
opportunity %>%
count(decision, group)
decision group n
1 buyDVD control 56
2 buyDVD treatment 41
3 nobuyDVD control 19
4 nobuyDVD treatment 34
# or
table(opportunity$decision, opportunity$group)
control treatment
buyDVD 56 41
nobuyDVD 19 34
# or
xtabs(~decision + group, data = opportunity)
group
decision control treatment
buyDVD 56 41
nobuyDVD 19 34
opportunity %>%
group_by(group) %>%
summarize(buy_prop = mean(decision == "buyDVD")) -> propDVD
propDVD
# A tibble: 2 × 2
group buy_prop
<fct> <dbl>
1 control 0.747
2 treatment 0.547
# or
prop.table(xtabs(~decision + group, data = opportunity), 2)
group
decision control treatment
buyDVD 0.7466667 0.5466667
nobuyDVD 0.2533333 0.4533333
prop.table(xtabs(~decision + group, data = opportunity), 2)[1, ]
control treatment
0.7466667 0.5466667
Again, interest is in whether the treatment and control groups were equally likely to buy a DVD after reading the experimental statements. Here, you’ll create a barplot to visualize the difference in proportions between the treatment and control groups.
Using the opportunity
dataset, plot group
fill
ed by decision
.
Call geom_bar()
to add a bar plot layer, setting the position to "fill"
to compare the relative frequencies. Note that fill should be devined by the decision
for each group.
ggplot(opportunity, aes(x = group, fill = decision)) +
geom_bar(position = "fill") +
theme_bw()
As in the previous chapter, you will permute the data to generate a distribution of differences as if the null hypothesis were true.
In the study, the number of individuals in each of the control and treatment groups is fixed. Additionally, when you assume that the null hypothesis is true—that is, the experiment had no effect on the outcome of buying a DVD—it is reasonable to infer that the number of individuals who would buy a DVD is also fixed. That is, 97 people were going to buy a DVD regardless of which treatment group they were in.
Using the new data and the methods from the previous chapter, create a randomization distribution of the difference in proportions calculated on permuted data.
Using the opportunity
dataset, calculate the observed difference in purchase rate.
Group by group
.
Summarize to calculate the proportion deciding to buy a DVD. That is, get the mean()
of cases of decision equalling "buyDVD"
.
Summarize again to calculate the diff()
erence of prop_buy
between groups.
# Calculate the observed difference in purchase rate
diff_obs <- opportunity %>%
# Group by group
group_by(group) %>%
# Calculate proportion deciding to buy a DVD
summarize(prop_buy = mean(decision == "buyDVD")) %>%
# Calculate difference between groups
summarize(stat = diff(prop_buy)) %>%
pull()
diff_obs
[1] -0.2
# or
diff_obs1 <- diff(prop.table(xtabs(~decision + group, data = opportunity), 2)[1, ])
diff_obs1
treatment
-0.2
Create a data frame of permuted differences in purchase rates.
Specify the model decision
vs. group
, with the success value "buyDVD"
.
Hypothesise "independence"
.
Generate 10,000
reps of type "permute"
.
Calculate the summary statistic "diff in props"
.
set.seed(14)
# Create data frame of permuted differences in purchase rates
opp_perm <- opportunity %>%
# Specify decision vs. group, where success is buying a DVD
specify(decision ~ group, success = "buyDVD") %>%
# Set the null hypothesis to independence
hypothesize(null = "independence") %>%
# Generate 10,000 reps of type permute
generate(reps = 10000, type = "permute") %>%
# Calculate the summary stat difference in proportions
calculate(stat = "diff in props", order = c("treatment", "control"))
# Review the result
opp_perm
Response: decision (factor)
Explanatory: group (factor)
Null Hypothesis: independence
# A tibble: 10,000 × 2
replicate stat
<int> <dbl>
1 1 -0.0400
2 2 -0.0400
3 3 -0.0133
4 4 0.147
5 5 -0.0133
6 6 0.0667
7 7 0.147
8 8 -0.0667
9 9 0.0133
10 10 0.0133
# … with 9,990 more rows
Draw a histogram of permuted differences.
Using the permutation dataset, opp_perm
, plot stat
.
Add a histogram layer with geom_histogram()
. The binwidth
is set to 0.005
.
Add a vertical line with geom_vline()
. The xintercept is at diff_obs
.
# Using the permuation data, plot stat
ggplot(opp_perm, aes(x = stat)) +
# Add a histogram layer with binwidth 0.005
geom_histogram(binwidth = 0.005) +
# Add a vline layer with intercept diff_obs
geom_vline(aes(xintercept = diff_obs), color = "red") +
theme_bw() +
labs(x = expression(hat(p)[treatment] - hat(p)[control]), y = "", title = "Simulation-Based Null Distribution")
Now that you’ve created the randomization distribution, you’ll use it to assess whether the observed difference in proportions is consistent with the null difference. You will measure this consistency (or lack thereof) with a p-value, or the proportion of permuted differences less than or equal to the observed difference.
The permuted dataset and the original observed statistic are available in your workspace as opp_perm
and diff_obs
respectively.
The p-value (which represents how often a null value is more extreme) is calculated by counting the number of null values which are less than the original difference.
visualize
the sampling distribution of the permuted statistics. Use shade_p_value()
to indicate the place where obs_stat = diff_obs
, and coloring in values below with the command direction = "less"
.# Visualize the statistic
opp_perm %>%
visualize() +
shade_p_value(obs_stat = diff_obs, direction = "less") +
theme_bw() +
labs(x = expression(hat(p)[treatment] - hat(p)[control]))
# Calculate the p-value using `get_p_value`
opp_perm %>%
get_p_value(obs_stat = diff_obs, direction = "less")
# A tibble: 1 × 1
p_value
<dbl>
1 0.0082
# or
opp_perm %>%
summarize(p_value = mean(stat <= diff_obs))
# A tibble: 1 × 1
p_value
<dbl>
1 0.0082
The small p-value indicates that the observed data are inconsistent with the null hypothesis. We should reject the null claim and conclude that financial advice does affect the likelihood of purchase.
opportunity <- opportunity %>%
mutate(decisionN = ifelse(decision == "buyDVD", 1, 0))
head(opportunity)
decision group decisionN
1 buyDVD control 1
2 buyDVD control 1
3 buyDVD control 1
4 buyDVD control 1
5 buyDVD control 1
6 buyDVD control 1
set.seed(16)
reps <- 10^4
stat <- numeric(reps)
for(i in 1:reps){
stat[i] <- diff(tapply(opportunity$decisionN,
sample(opportunity$group), mean))
}
hist()
.hist(stat, breaks = "Scott",
xlab = expression(hat(p)[treatmeant] - hat(p)[control]),
col = "pink", main = "Simulation-Based Null Distribution")
abline(v = diff_obs, col = "red")
ggplot2
functions.DF <- data.frame(stat = stat)
ggplot(data = DF, aes(x = stat)) +
geom_histogram(aes(fill = stat >= diff_obs), binwidth = .03, alpha = 0.5, color = "black") +
guides(fill = "none") +
scale_fill_manual(values = c("purple", "red")) +
labs(x = expression(hat(p)[treatment] - hat(p)[control]),
y = "",
title = "Simulation-Based Null Distribution") +
theme_bw()
(pvalue <- mean(stat <= diff_obs))
[1] 0.0096