1 Introduction

This article provides ideas, pointers, and teasers for using R (R Core Team (2017)) with the RStudio (RStudio Team (2016)) integrated development editor (IDE). R is an interactive programming and graphics environment (“R FAQ” 2017) that is the lingua franca for research statisticians; however, R is not just for research statisticians! Data analysts (Vance (2009)) and an ever growing number of scientists from outside of the statistical sciences (Tippmann (2015)) are using R as their computational engine of choice. Because R is free, it has become a very attractive tool to use in high school mathematics and statistics courses as well. The remainder of the article assumes the user has already installed both R and RStudio.

2 Using R inside RStudio

R can be run as a stand alone; however, it is more convenient to run R inside an IDE, and one of more popular and user friendly IDEs is the RStudio IDE. When the RStudio IDE is launched on a MAC, it appears as shown in Figure 2.1.

RStudio with three panes

Figure 2.1: RStudio with three panes

The left pane in Figure 2.1 is known as the Console pane. The user may enter commands at the R prompt (>) or use an R script to maintain a record of all commands sent to R.

2.1 Using R as a calculator

Basic arithmetic may be performed at the R command prompt much like using a calculator. The command prompt is shown below; but, the reader only needs to enter commands after the R prompt, not type the prompt. The pound or hash symbol (#) is used to comment code in R.

> 2 + 3         # add two numbers
[1] 5
> (5 - 3)/3     # subtraction and division
[1] 0.6666667
> 5^3           # raise 5 to the third power
[1] 125
> 2 * pi        # 2 times Pi  
[1] 6.283185

There are thousands of mathematical and statistical functions available in R. A few R functions that all work in a similar fashion by including arguments inside parentheses are shown below.

> sqrt(49)    # square root of 49
[1] 7
> exp(2)      # e raised to the second power
[1] 7.389056
> sin(pi/2)   # sin of Pi/2
[1] 1

Complete the free DataCamp online Introduction to R to learn more about R.

2.2 Writing your own function

In this example, the exponential probability density function defined in (2.1) with \(\lambda = 1/10\)

\[\begin{equation} f(x) = \begin{cases} \lambda e^{-\lambda x} & x \geq 0\\ 0 & x < 0 \end{cases} \tag{2.1} \end{equation}\]

is used to compute numerically the expected value and the variance of \(X\). The expected value of a continuous random variable \(X\) is defined as

\[E(X) = \int_{-\infty}^{\infty}x\cdot f(x)\, dx,\] while the variance of a random variable \(X\) is defined as

\[\text{Var}(X) = \int_{-\infty}^{\infty}(x - \mu_X)^2 \cdot f(x)\, dx.\]

The integrate() function is used first to verify the function f is a valid probability density function. To create a function in R, provide a name (f in the example below) for the function, use the keyword function(x), then define the function inside curly braces ({}). To learn more about writing functions, consider the DataCamp course Writing Functions in R.

> f <- function(x){1/10*exp(-x/10)}
> integrate(f, 0, Inf)      # Show f integrates to 1
1 with absolute error < 0.00011

\[\int_{0}^{\infty}\tfrac{1}{10}e^{-\tfrac{x}{10}} \, dx = 1\]

> xfx <- function(x){x * 1/10*exp(-x/10)}
> EX <- integrate(xfx, 0, Inf)$value    # E(X)
> EX
[1] 10

\[E(X) = \int_{0}^{\infty}x \cdot \tfrac{1}{10}e^{-\tfrac{x}{10}} \, dx = 10\]

> vx <- function(x){(x - EX)^2 * 1/10*exp(-x/10)}
> VX <- integrate(vx, 0, Inf)$value
> VX
[1] 100

\[\text{Var}(X) = \int_{0}^{\infty}(x - 10)^2 \cdot \tfrac{1}{10}e^{-\tfrac{x}{10}} \, dx = 100\]

Note: R has already defined the exponential probability density function (dexp). The function f was used to show the reader how to define a function. If one is actually working with an exponential probability density function, it is best to use the functions R has already defined.

> integrate(dexp, rate= 1/10, lower = 0, upper = Inf)  # show valid pdf
1 with absolute error < 0.00011
> integrate(dexp, rate = 1/10, 1, 2)    # P(1 < X < 2)
0.08610666 with absolute error < 9.6e-16
> # Or
> a2 <- pexp(2, 1/10) - pexp(1, 1/10)
> a2
[1] 0.08610666

\[\int_1^2\tfrac{1}{10}e^{-\tfrac{x}{10}} = 0.0861067\]

2.3 Data

To enter data as a vector, one may use the c() function as shown next.

> weight <- c(60, 43, 54, 46, 51)  # weight of children in pounds
> mean(weight)                     # mean weight
[1] 50.8
> sd(weight)                       # standard deviation of weight
[1] 6.685806
> IQR(weight)                      # Interquartile range of weight
[1] 8
> fivenum(weight)                  # Tukey five number summary
[1] 43 46 51 54 60

To reduce the chance of a typographical error, it is best to use data that has already been recorded. R has a numerous data sets that may be accessed. To see a list of data sets available type data() at the R prompt. Rectangular data, where the variables are possibly different data types, is stored in a structure known as a data frame. Consider the data frame ChickWeight.

> head(ChickWeight)      # Show the first 6 rows of ChickWeight
Grouped Data: weight ~ Time | Chick
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1
> str(ChickWeight)       # Show structure of ChickWeight
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula'  language ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"
> summary(ChickWeight)   # Compute summary of ChickWeight
     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506          

To read data from an external file, consider using read.table() or one of its wrappers. In the next example, the function read.csv(), a wrapper for read.table() when reading comma separated files, is used to read a *.csv file from a URL.

> BW <- read.csv("http://www1.appstate.edu/~arnholta/Data/Beerwings.csv")
> str(BW)
'data.frame':   30 obs. of  4 variables:
 $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Hotwings: int  4 5 5 6 7 7 7 8 8 8 ...
 $ Beer    : int  24 0 12 12 12 12 24 24 0 12 ...
 $ Gender  : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 1 2 2 ...

To learn more about importing data into R with other file types, consider taking the DataCamp course Importing Data in R.

3 Exploratory Data Analysis

In this example, consider the data stored in the data frame ChickWeight that records the weight, time, chick, and diet from an experiment on early growth of chicks. What does the distribution of weight look like? Does diet play a role in weight gain? The package ggplot2 (Wickham and Chang (2016)) is used to create side by side boxplots of weight by diet (See Figure 3.1). To learn more about using ggplot2, consider the DataCamp course Data Visualization with ggplot2. Note that the boxplot for Diet 1 is noticeably wider than the other boxplots because there are more chicks in Diet 1 than the other Diets (See Table 3.1).

> library(tidyverse)  # load tidyverse collection of packages
> ggplot(data = ChickWeight, aes(x = Diet, y = weight, fill = Diet)) + 
+   geom_boxplot(notch = TRUE, varwidth = TRUE) + 
+   theme_bw() + 
+   guides(fill = FALSE) + 
+   labs(x = "Diet", y = "Weight (g)") + 
+   coord_flip()
Boxplot of weight gain by diet

Figure 3.1: Boxplot of weight gain by diet

Another representation of the distribution of weight by Diet is shown in Figure 3.2 with density plots.

> ggplot(data = ChickWeight, aes(x = weight, fill = Diet)) + 
+   geom_density() + 
+   facet_grid(Diet ~ .) + 
+   theme_bw() + 
+   labs(y = "", x = "Weight (g)")
Density plots of weight gain by diet

Figure 3.2: Density plots of weight gain by diet

Each density plot for the four diets shown in Figure 3.2 is skewed to the right. Next, appropriate measures of center and spread are computed using a tidyverse approach. To learn more about the tidyverse approach to analyzing data, consider the DataCamp course Introduction to the Tidyverse.

> CW <- ChickWeight %>% 
+   group_by(Diet) %>% 
+   summarize(Center = median(weight), Spread = IQR(weight), n = n())
> CW
# A tibble: 4 x 4
    Diet Center Spread     n
  <fctr>  <dbl>  <dbl> <int>
1      1   88.0  78.75   220
2      2  104.5  97.50   120
3      3  125.5 131.25   120
4      4  129.5 113.50   118
Table 3.1: Summary statistics of weight grouped by Diet
Diet Center Spread n
1 88.0 78.75 220
2 104.5 97.50 120
3 125.5 131.25 120
4 129.5 113.50 118

4 Simple Linear Regression

Consider the data frame VIT2005 from the R package PASWR2 by Arnholt (2016). The package PASWR2 must be installed before it can be loaded. To install PASWR2, type install.packages("PASWR2") at the R prompt. To load a package, use the command library(PackageName). You only need to install a package once; but, the package must be loaded each R session to access functions and data sets in the package. The data frame VIT2005 has values for several variables including total price in euros (totalprice) and area in square meters (area) for apartments in Vittoria, Spain. The code below creates the scatter plot of totalprice versus area and superimposes a least squares line over the points as shown in Figure 4.1.

> library(PASWR2)
> ggplot(data = VIT2005, aes(x = area, y = totalprice)) + 
+   geom_point(alpha = 0.5, color = "darkgreen") + 
+   theme_bw() + 
+   geom_smooth(method = "lm") + 
+   labs(x = "Area (square meters)", y = "Total Price (euros)")
Scatterplot of `totalprice` versus `area`

Figure 4.1: Scatterplot of totalprice versus area

> slr_mod <- lm(totalprice ~ area, data = VIT2005)
> summary(slr_mod)

Call:
lm(formula = totalprice ~ area, data = VIT2005)

Residuals:
    Min      1Q  Median      3Q     Max 
-156126  -21564   -2155   19493  120674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40822.4    12170.1   3.354  0.00094 ***
area          2704.8      133.6  20.243  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40810 on 216 degrees of freedom
Multiple R-squared:  0.6548,    Adjusted R-squared:  0.6532 
F-statistic: 409.8 on 1 and 216 DF,  p-value: < 2.2e-16
Table 4.1: Coefficients from regressing totalprice on area
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40822.416 12170.0915 3.354323 0.0009396
area 2704.751 133.6157 20.242761 0.0000000

The code below creates a scatterplot of totalprice versus area while mapping the variable toilets to the color aesthetic. Figure 4.2 shows the resulting scatterplot along with three different least squares lines.

> ggplot(data = VIT2005, aes(x = area, y = totalprice, color = as.factor(toilets))) + 
+   geom_point(alpha = 0.5) + 
+   theme_bw() + 
+   geom_smooth(method = "lm", se = FALSE) + 
+   geom_smooth(method = "lm", aes(group = 1), se = FALSE, linetype = "dashed") +
+   labs(color = "Number of\n Toilets") + 
+   scale_color_manual(values = c("purple", "darkred")) +
+   labs(x = "Area (square meters)", y = "Total Price (euros)")
Scatterplot of `totalprice` versus `area` with points colored according to the number of bathrooms --- the purple line is the least squares line fit to apartments with one toilet; the dark red line is the least squares line fit to apartments with two toilets; and the dashed blue line is the least squares line for all apartments

Figure 4.2: Scatterplot of totalprice versus area with points colored according to the number of bathrooms — the purple line is the least squares line fit to apartments with one toilet; the dark red line is the least squares line fit to apartments with two toilets; and the dashed blue line is the least squares line for all apartments

5 DataCamp

DataCamp had 92 premium content courses as of the time this article was written. While the premium content courses are not free for general public use, qualified faculty can request access to DataCamp premium content courses for their classes. The author has used over fifteen of the premium content courses with his students in various classes. To learn more about academic access to premium content courses see DataCamp for the Classroom.

6 RStudio

If you are going to use RStudio with a class, many problems can be averted by providing the class access to some form of pre-installed/configured RStudio. There are many ways one can deploy RStudio. The author uses a virtual machine to host RStudio Server Pro where all students have access to RStudio via a web browser. Using RStudio Server Pro places all students in the class on a level playing field, assuming the students have internet access. Faculty can request RStudio waive the $10,000/year license fee for the RStudio Server Pro when it is used for teaching. Individuals can also set up their own cloud server for as little as $5/month by following Dean Attali’s directions on How to get your very own RStudio Server and Shiny Server with DigitalOcean.

References

Arnholt, Alan T. 2016. PASWR2: Probability and Statistics with R, Second Edition. https://CRAN.R-project.org/package=PASWR2.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

“R FAQ.” 2017. Accessed November 22. https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-is-R_003f.

RStudio Team. 2016. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.

Tippmann, Sylvia. 2015. “Programming Tools: Adventures with R.” Nature News 517 (7532): 109. doi:10.1038/517109a.

Vance, Ashlee. 2009. “R, the Software, Finds Fans in Data Analysts.” The New York Times, January. http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all.

Wickham, Hadley, and Winston Chang. 2016. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.