Using R Markdown allows us to write both text and code in the same document. Use R code chunks to insert code:
```{r}
# Some code
x <- 1:10
```
Use inline R to write answers inline using the following format: `r R_CODE`. For example, to compute the mean of the values 1, 3, 5, and 7, one might can use `r mean(c(1, 3, 5, 7))`. The mean of 1, 3, 5, and 7 is 4.
To use functions in packages such as psych, one must either specify the package by prepending the function with the package name and two colons or load the package using the command library(PackageName).
Consider using the function describe on the mtcars data set.
describe(mtcars) # psych has not been loaded!
Error in describe(mtcars): could not find function "describe"
psych::describe(mtcars)
vars n mean sd median trimmed mad min max range skew
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05
kurtosis se
mpg -0.37 1.07
cyl -1.76 0.32
disp -1.21 21.91
hp -0.14 12.12
drat -0.71 0.09
wt -0.02 0.17
qsec 0.34 0.32
vs -2.00 0.09
am -1.92 0.09
gear -1.07 0.13
carb 1.26 0.29
qsechist(mtcars$qsec, col = "blue", freq = FALSE,
main = "Histogram of time to travel quarter mile",
xlab = "time in seconds")
Mean <- mean(mtcars$qsec)
Mean
[1] 17.84875
SD <- sd(mtcars$qsec)
SD
[1] 1.786943
The distribution of qsec is unimodal and symmetric with a mean of 17.85 seconds and a standard deviation of 1.79 seconds.
hist(mtcars$qsec, col = "blue", freq = FALSE,
main = "Histogram of time to travel quarter mile",
xlab = "time in seconds", xlim = c(13, 23))
curve(dnorm(x, Mean, SD), 13, 23, col = "purple", add = TRUE, lwd = 3)
ggplot2library(ggplot2)
ggplot(data = mtcars, aes(x = qsec, ..density..)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
xlim(Mean - 3.5*SD, Mean + 3.5*SD) +
labs(x = "Time in seconds") +
geom_density(fill = "red", alpha = 0.4) +
stat_function(fun = dnorm, args = list(mean = Mean, sd = SD),
inherit.aes = FALSE, size = 2, color = "purple") +
theme_bw()
Hypotheses — State the null and alternative hypotheses.
Test Statistic
Rejection Region Calculations
Statistical Conclusion
English Conclusion
A bottled water company acquires its water from two independent sources, X and Y. The company suspects that the sodium content in the water from source X is less than the sodium content from source Y. An independent agency measures the sodium content in 20 samples from source X and 10 samples from source Y and stores them in data frame WATER of the PASWR2 package. Is there statistical evidence to suggest the average sodium content in the water from source X is less than the average sodium content in Y?
Solution: To solve this problem, start by verifying the reasonableness of the normality assumption.
library(PASWR2) # load the PASWR2 package
library(ggplot2) # load the ggplot2 package
library(lsr) # load the lsr package
library(DescTools) # load the DescTools package
boxplot(sodium ~ source, data = WATER)
ggplot(data = WATER, aes(x = source, y = sodium)) +
geom_boxplot() +
theme_bw()
LeveneTest(sodium ~ source, data = WATER)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 10.033 0.003697 **
28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data = WATER, aes(sample = sodium, color = source)) +
stat_qq() +
theme_bw()
Hypotheses — \(H_0: \mu_X - \mu_Y = 0\) versus \(H_1: \mu_X - \mu_Y <0\)
Test Statistic —The test statistic is \(\bar{X} - \bar{Y}\). The standardized test statistic under the assumptioon that \(H_0\) is true and its approximate distribution are
\[\frac{\left[(\bar{X}-\bar{Y} - \delta_0) \right]}{\sqrt{\frac{S_X^2}{n_x}+\frac{S_Y^2}{n_Y}}} \overset{\bullet}{\sim} t_{\nu}\]
TR <- t.test(sodium ~ source, data = WATER, alternative = "less")
TR
Welch Two Sample t-test
data: sodium by source
t = -1.8589, df = 22.069, p-value = 0.03822
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.3665724
sample estimates:
mean in group x mean in group y
76.4 81.2
Statistical Conclusion — Since the p-value is 0.0382165, reject the null hypothesis.
English Conclusion — There is evidence to suggest the average sodium content for source X is less than the average sodium content for source Y.
CohenD(WATER$x, WATER$y, na.rm = TRUE)
[1] -0.5205894
attr(,"magnitude")
[1] "medium"
cohensD(formula = sodium ~ source, data = WATER)
[1] 0.5205894
library(dplyr)
NDF <- WATER %>%
group_by(source) %>%
summarize(Mean = mean(sodium), VAR = var(sodium), n = n())
NDF
# A tibble: 2 x 4
source Mean VAR n
<fctr> <dbl> <dbl> <int>
1 x 76.4 122.778947 20
2 y 81.2 5.288889 10
sp <- sqrt((122.77*19 + 5.28*9)/(20 + 10 - 2))
sp
[1] 9.219835
C_D <- (76.4 - 81.2)/sp
C_D
[1] -0.5206167