R
with the RStudio
Integrated Development Editor—Ideas, Pointers, and TeasersThis article provides ideas, pointers, and teasers for using R
(R Core Team (2017)) with the RStudio
(RStudio Team (2016)) integrated development editor (IDE). R
is an interactive programming and graphics environment (“R FAQ” 2017) that is the lingua franca for research statisticians; however, R
is not just for research statisticians! Data analysts (Vance (2009)) and an ever growing number of scientists from outside of the statistical sciences (Tippmann (2015)) are using R
as their computational engine of choice. Because R
is free, it has become a very attractive tool to use in high school mathematics and statistics courses as well. The remainder of the article assumes the user has already installed both R
and RStudio
.
R
inside RStudio
R
can be run as a stand alone; however, it is more convenient to run R
inside an IDE, and one of more popular and user friendly IDEs is the RStudio
IDE. When the RStudio
IDE is launched on a MAC, it appears as shown in Figure 2.1.
The left pane in Figure 2.1 is known as the Console pane. The user may enter commands at the R prompt (>
) or use an R script
to maintain a record of all commands sent to R
.
R
as a calculatorBasic arithmetic may be performed at the R
command prompt much like using a calculator. The command prompt is shown below; but, the reader only needs to enter commands after the R
prompt, not type the prompt. The pound or hash symbol (#
) is used to comment code in R
.
> 2 + 3 # add two numbers
[1] 5
> (5 - 3)/3 # subtraction and division
[1] 0.6666667
> 5^3 # raise 5 to the third power
[1] 125
> 2 * pi # 2 times Pi
[1] 6.283185
There are thousands of mathematical and statistical functions available in R
. A few R
functions that all work in a similar fashion by including arguments inside parentheses are shown below.
> sqrt(49) # square root of 49
[1] 7
> exp(2) # e raised to the second power
[1] 7.389056
> sin(pi/2) # sin of Pi/2
[1] 1
Complete the free DataCamp online Introduction to R to learn more about R
.
In this example, the exponential probability density function defined in (2.1) with \(\lambda = 1/10\)
\[\begin{equation} f(x) = \begin{cases} \lambda e^{-\lambda x} & x \geq 0\\ 0 & x < 0 \end{cases} \tag{2.1} \end{equation}\]is used to compute numerically the expected value and the variance of \(X\). The expected value of a continuous random variable \(X\) is defined as
\[E(X) = \int_{-\infty}^{\infty}x\cdot f(x)\, dx,\] while the variance of a random variable \(X\) is defined as
\[\text{Var}(X) = \int_{-\infty}^{\infty}(x - \mu_X)^2 \cdot f(x)\, dx.\]
The integrate()
function is used first to verify the function f
is a valid probability density function. To create a function in R
, provide a name (f
in the example below) for the function, use the keyword function(x)
, then define the function inside curly braces ({}
). To learn more about writing functions, consider the DataCamp course Writing Functions in R.
> f <- function(x){1/10*exp(-x/10)}
> integrate(f, 0, Inf) # Show f integrates to 1
1 with absolute error < 0.00011
\[\int_{0}^{\infty}\tfrac{1}{10}e^{-\tfrac{x}{10}} \, dx = 1\]
> xfx <- function(x){x * 1/10*exp(-x/10)}
> EX <- integrate(xfx, 0, Inf)$value # E(X)
> EX
[1] 10
\[E(X) = \int_{0}^{\infty}x \cdot \tfrac{1}{10}e^{-\tfrac{x}{10}} \, dx = 10\]
> vx <- function(x){(x - EX)^2 * 1/10*exp(-x/10)}
> VX <- integrate(vx, 0, Inf)$value
> VX
[1] 100
\[\text{Var}(X) = \int_{0}^{\infty}(x - 10)^2 \cdot \tfrac{1}{10}e^{-\tfrac{x}{10}} \, dx = 100\]
Note: R
has already defined the exponential probability density function (dexp
). The function f
was used to show the reader how to define a function. If one is actually working with an exponential probability density function, it is best to use the functions R
has already defined.
> integrate(dexp, rate= 1/10, lower = 0, upper = Inf) # show valid pdf
1 with absolute error < 0.00011
> integrate(dexp, rate = 1/10, 1, 2) # P(1 < X < 2)
0.08610666 with absolute error < 9.6e-16
> # Or
> a2 <- pexp(2, 1/10) - pexp(1, 1/10)
> a2
[1] 0.08610666
\[\int_1^2\tfrac{1}{10}e^{-\tfrac{x}{10}} = 0.0861067\]
To enter data as a vector, one may use the c()
function as shown next.
> weight <- c(60, 43, 54, 46, 51) # weight of children in pounds
> mean(weight) # mean weight
[1] 50.8
> sd(weight) # standard deviation of weight
[1] 6.685806
> IQR(weight) # Interquartile range of weight
[1] 8
> fivenum(weight) # Tukey five number summary
[1] 43 46 51 54 60
To reduce the chance of a typographical error, it is best to use data that has already been recorded. R
has a numerous data sets that may be accessed. To see a list of data sets available type data()
at the R
prompt. Rectangular data, where the variables are possibly different data types, is stored in a structure known as a data frame. Consider the data frame ChickWeight
.
> head(ChickWeight) # Show the first 6 rows of ChickWeight
Grouped Data: weight ~ Time | Chick
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> str(ChickWeight) # Show structure of ChickWeight
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
$ weight: num 42 51 59 64 76 93 106 125 149 171 ...
$ Time : num 0 2 4 6 8 10 12 14 16 18 ...
$ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
$ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "outer")=Class 'formula' language ~Diet
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Time"
..$ y: chr "Body weight"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(gm)"
> summary(ChickWeight) # Compute summary of ChickWeight
weight Time Chick Diet
Min. : 35.0 Min. : 0.00 13 : 12 1:220
1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
Median :103.0 Median :10.00 20 : 12 3:120
Mean :121.8 Mean :10.72 10 : 12 4:118
3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
Max. :373.0 Max. :21.00 19 : 12
(Other):506
To read data from an external file, consider using read.table()
or one of its wrappers. In the next example, the function read.csv()
, a wrapper for read.table()
when reading comma separated files, is used to read a *.csv
file from a URL.
> BW <- read.csv("http://www1.appstate.edu/~arnholta/Data/Beerwings.csv")
> str(BW)
'data.frame': 30 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hotwings: int 4 5 5 6 7 7 7 8 8 8 ...
$ Beer : int 24 0 12 12 12 12 24 24 0 12 ...
$ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 1 2 2 ...
To learn more about importing data into R
with other file types, consider taking the DataCamp course Importing Data in R.
In this example, consider the data stored in the data frame ChickWeight
that records the weight, time, chick, and diet from an experiment on early growth of chicks. What does the distribution of weight
look like? Does diet play a role in weight gain? The package ggplot2
(Wickham and Chang (2016)) is used to create side by side boxplots of weight by diet (See Figure 3.1). To learn more about using ggplot2
, consider the DataCamp course Data Visualization with ggplot2. Note that the boxplot for Diet 1 is noticeably wider than the other boxplots because there are more chicks in Diet 1 than the other Diets (See Table 3.1).
> library(tidyverse) # load tidyverse collection of packages
> ggplot(data = ChickWeight, aes(x = Diet, y = weight, fill = Diet)) +
+ geom_boxplot(notch = TRUE, varwidth = TRUE) +
+ theme_bw() +
+ guides(fill = FALSE) +
+ labs(x = "Diet", y = "Weight (g)") +
+ coord_flip()
Another representation of the distribution of weight
by Diet
is shown in Figure 3.2 with density plots.
> ggplot(data = ChickWeight, aes(x = weight, fill = Diet)) +
+ geom_density() +
+ facet_grid(Diet ~ .) +
+ theme_bw() +
+ labs(y = "", x = "Weight (g)")
Each density plot for the four diets shown in Figure 3.2 is skewed to the right. Next, appropriate measures of center and spread are computed using a tidyverse approach. To learn more about the tidyverse approach to analyzing data, consider the DataCamp course Introduction to the Tidyverse.
> CW <- ChickWeight %>%
+ group_by(Diet) %>%
+ summarize(Center = median(weight), Spread = IQR(weight), n = n())
> CW
# A tibble: 4 x 4
Diet Center Spread n
<fctr> <dbl> <dbl> <int>
1 1 88.0 78.75 220
2 2 104.5 97.50 120
3 3 125.5 131.25 120
4 4 129.5 113.50 118
Diet | Center | Spread | n |
---|---|---|---|
1 | 88.0 | 78.75 | 220 |
2 | 104.5 | 97.50 | 120 |
3 | 125.5 | 131.25 | 120 |
4 | 129.5 | 113.50 | 118 |
Consider the data frame VIT2005
from the R
package PASWR2
by Arnholt (2016). The package PASWR2
must be installed before it can be loaded. To install PASWR2
, type install.packages("PASWR2")
at the R
prompt. To load a package, use the command library(PackageName)
. You only need to install a package once; but, the package must be loaded each R
session to access functions and data sets in the package. The data frame VIT2005
has values for several variables including total price in euros (totalprice
) and area in square meters (area
) for apartments in Vittoria, Spain. The code below creates the scatter plot of totalprice
versus area
and superimposes a least squares line over the points as shown in Figure 4.1.
> library(PASWR2)
> ggplot(data = VIT2005, aes(x = area, y = totalprice)) +
+ geom_point(alpha = 0.5, color = "darkgreen") +
+ theme_bw() +
+ geom_smooth(method = "lm") +
+ labs(x = "Area (square meters)", y = "Total Price (euros)")
> slr_mod <- lm(totalprice ~ area, data = VIT2005)
> summary(slr_mod)
Call:
lm(formula = totalprice ~ area, data = VIT2005)
Residuals:
Min 1Q Median 3Q Max
-156126 -21564 -2155 19493 120674
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40822.4 12170.1 3.354 0.00094 ***
area 2704.8 133.6 20.243 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40810 on 216 degrees of freedom
Multiple R-squared: 0.6548, Adjusted R-squared: 0.6532
F-statistic: 409.8 on 1 and 216 DF, p-value: < 2.2e-16
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 40822.416 | 12170.0915 | 3.354323 | 0.0009396 |
area | 2704.751 | 133.6157 | 20.242761 | 0.0000000 |
The code below creates a scatterplot of totalprice
versus area
while mapping the variable toilets
to the color aesthetic. Figure 4.2 shows the resulting scatterplot along with three different least squares lines.
> ggplot(data = VIT2005, aes(x = area, y = totalprice, color = as.factor(toilets))) +
+ geom_point(alpha = 0.5) +
+ theme_bw() +
+ geom_smooth(method = "lm", se = FALSE) +
+ geom_smooth(method = "lm", aes(group = 1), se = FALSE, linetype = "dashed") +
+ labs(color = "Number of\n Toilets") +
+ scale_color_manual(values = c("purple", "darkred")) +
+ labs(x = "Area (square meters)", y = "Total Price (euros)")
DataCamp had 92 premium content courses as of the time this article was written. While the premium content courses are not free for general public use, qualified faculty can request access to DataCamp premium content courses for their classes. The author has used over fifteen of the premium content courses with his students in various classes. To learn more about academic access to premium content courses see DataCamp for the Classroom.
If you are going to use RStudio
with a class, many problems can be averted by providing the class access to some form of pre-installed/configured RStudio
. There are many ways one can deploy RStudio
. The author uses a virtual machine to host RStudio Server Pro where all students have access to RStudio
via a web browser. Using RStudio Server Pro places all students in the class on a level playing field, assuming the students have internet access. Faculty can request RStudio
waive the $10,000/year license fee for the RStudio Server Pro when it is used for teaching. Individuals can also set up their own cloud server for as little as $5/month by following Dean Attali’s directions on How to get your very own RStudio Server and Shiny Server with DigitalOcean.
Arnholt, Alan T. 2016. PASWR2: Probability and Statistics with R, Second Edition. https://CRAN.R-project.org/package=PASWR2.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
RStudio Team. 2016. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Tippmann, Sylvia. 2015. “Programming Tools: Adventures with R.” Nature News 517 (7532): 109. doi:10.1038/517109a.
Vance, Ashlee. 2009. “R, the Software, Finds Fans in Data Analysts.” The New York Times, January. http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all.
Wickham, Hadley, and Winston Chang. 2016. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.