Reading In Data

Author

Alan T. Arnholt

Published

Last modified on August 15, 2023 10:17:53 Eastern Daylight Time

Data

This section presents a data set that shows how different data types should be read into R as well as several functions that are useful for working with different types of R objects. Consider the data stored as a CSV file at

https://raw.githubusercontent.com/alanarnholt/Data/master/POPLAR3.CSV

The following description of the data is from Minitab 15:

MINITAB

In an effort to maximize yield, researchers designed an experiment to determine how two factors, Site and Treatment, influence the Weight of four-year-old poplar clones. They planted trees on two sites: Site 1 is a moist site with rich soil, and Site 2 is a dry, sandy site. They applied four different treatments to the trees: Treatment 1 was the control (no treatment); Treatment 2 used fertilizer; Treatment 3 used irrigation; and Treatment 4 use both fertilizer and irrigation. To account for a variety of weather conditions, the researchers replicated the data by planting half the trees in Year 1, and the other half in Year 2.

Base R

The data from Poplar3.CSV is read into the data frame poplar using the read.csv() function, and the first five rows of the data frame are shown using the function head() with the argument n = 5 to show the first five rows of the data frame instead of the default n = 6 rows.

R Code
site <- "https://raw.githubusercontent.com/alanarnholt/Data/master/POPLAR3.CSV"
poplar <- read.csv(file = url(site))
knitr::kable(head(poplar, n = 5))  # show first five rows
site year treatment diameter height weight age
1 1 1 2.23 3.76 0.17 3
1 1 1 2.12 3.15 0.15 3
1 1 1 1.06 1.85 0.02 3
1 1 1 2.12 3.64 0.16 3
1 1 1 2.99 4.64 0.37 3

When dealing with imported data sets, it is always good to examine their contents using functions such as str() and summary(), which show the structure and provide appropriate summaries, respectively, for different types of objects.

R Code
str(poplar)
'data.frame':   298 obs. of  7 variables:
 $ site     : int  1 1 1 1 1 1 1 1 1 2 ...
 $ year     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
 $ diameter : num  2.23 2.12 1.06 2.12 2.99 4.01 2.41 2.75 2.2 4.09 ...
 $ height   : num  3.76 3.15 1.85 3.64 4.64 5.25 4.07 4.72 4.17 5.73 ...
 $ weight   : num  0.17 0.15 0.02 0.16 0.37 0.73 0.22 0.3 0.19 0.78 ...
 $ age      : int  3 3 3 3 3 3 3 3 3 3 ...
summary(poplar)
      site           year        treatment        diameter      
 Min.   :1.00   Min.   :1.00   Min.   :1.000   Min.   :-99.000  
 1st Qu.:1.00   1st Qu.:1.00   1st Qu.:2.000   1st Qu.:  3.605  
 Median :2.00   Median :2.00   Median :2.500   Median :  5.175  
 Mean   :1.51   Mean   :1.51   Mean   :2.503   Mean   :  3.862  
 3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:3.750   3rd Qu.:  6.230  
 Max.   :2.00   Max.   :2.00   Max.   :4.000   Max.   :  8.260  
     height            weight             age       
 Min.   :-99.000   Min.   :-99.000   Min.   :3.000  
 1st Qu.:  5.495   1st Qu.:  0.605   1st Qu.:3.000  
 Median :  6.910   Median :  1.640   Median :4.000  
 Mean   :  5.902   Mean   :  1.099   Mean   :3.507  
 3rd Qu.:  8.750   3rd Qu.:  3.435   3rd Qu.:4.000  
 Max.   : 10.900   Max.   :  6.930   Max.   :4.000  

From typing str(poplar) at the R prompt, one can see that all seven variables are either integer or numeric. From the description, the variables Site and Treatment are factors. Further investigation into the experiment reveals that year and Age are factors as well. Recall that factors are an extension of vectors designed for storing categorical information. The results of summary(poplar) indicate the minimum values for Diameter, Height, and Weight are all -99, which does not make sense unless one is told that a value of -99 for these variables represents a missing value. Once one understands that the variables Site, Year, Treatment, and Age are factors and that the value -99 has been used to represent missing values for the variables Diameter, Height, and Weight, appropriate arguments to read.csv() can be entered. The data is now read into the object poplarC using na.strings = "-99" to store the NA values correctly. The argument colClasses= requires a vector that indicates the desired class of each column.

R Code
poplarC <- read.csv(file = url(site), na.strings = "-99", 
                    colClasses = c(rep("factor", 3), 
                                   rep("numeric", 3), 
                                   "factor"))
str(poplarC)
'data.frame':   298 obs. of  7 variables:
 $ site     : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 2 ...
 $ year     : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ treatment: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ diameter : num  2.23 2.12 1.06 2.12 2.99 4.01 2.41 2.75 2.2 4.09 ...
 $ height   : num  3.76 3.15 1.85 3.64 4.64 5.25 4.07 4.72 4.17 5.73 ...
 $ weight   : num  0.17 0.15 0.02 0.16 0.37 0.73 0.22 0.3 0.19 0.78 ...
 $ age      : Factor w/ 2 levels "3","4": 1 1 1 1 1 1 1 1 1 1 ...

In the event different values (999, 99, 9999) for different variables (var1, var2, var3) are used to represent missing values in a data set, the argument na.strings= will no longer be able to solve the problem directly. Although one can pass a vector of the form na.strings = c(999, 99, 9999), this will simply replace all values that are 999, 99, or 9999 with NAs. If the first variable has a legitimate value of 99, then it too would be replaced with an NA value. One solution for this problem in general is to read the data set into a data frame (DF), to assign the data frame to a different name so that the cleaned up data set is not confused with the original data, and to use filtering to assign NAs to values of var1, var2, and var3 that have entries of 999, 99, and 999, respectively.

R Code
DF <- read.table(file=url(site), header=TRUE)
df <- DF
df[df$var1==999,  "var1"] = NA
df[df$var2==99,   "var2"] = NA
df[df$var3==9999, "var3"] = NA

Once a variable has its class changed from int to factor, labeling the levels of the factor can be accomplished without difficulties. To facilitate analysis of the poplarC data, labels for the levels of the variables Site and Treatment are assigned.

R Code
poplarC$site <- factor(poplarC$site, labels = c("Moist", "Dry"))
TreatmentLevels <- c("Control", "Fertilizer", "Irrigation", "FertIrriga")
poplarC$Treatment <- factor(poplarC$treatment, labels = TreatmentLevels)
str(poplarC)
'data.frame':   298 obs. of  8 variables:
 $ site     : Factor w/ 2 levels "Moist","Dry": 1 1 1 1 1 1 1 1 1 2 ...
 $ year     : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ treatment: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ diameter : num  2.23 2.12 1.06 2.12 2.99 4.01 2.41 2.75 2.2 4.09 ...
 $ height   : num  3.76 3.15 1.85 3.64 4.64 5.25 4.07 4.72 4.17 5.73 ...
 $ weight   : num  0.17 0.15 0.02 0.16 0.37 0.73 0.22 0.3 0.19 0.78 ...
 $ age      : Factor w/ 2 levels "3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Treatment: Factor w/ 4 levels "Control","Fertilizer",..: 1 1 1 1 1 1 1 1 1 1 ...

Using readr functions

R Code
poplarR <- read_csv(file = url(site))
head(poplarR, n = 5)  # show first five rows
# A tibble: 5 × 7
   site  year treatment diameter height weight   age
  <dbl> <dbl>     <dbl>    <dbl>  <dbl>  <dbl> <dbl>
1     1     1         1     2.23   3.76   0.17     3
2     1     1         1     2.12   3.15   0.15     3
3     1     1         1     1.06   1.85   0.02     3
4     1     1         1     2.12   3.64   0.16     3
5     1     1         1     2.99   4.64   0.37     3
summary(poplarR)
      site           year        treatment        diameter      
 Min.   :1.00   Min.   :1.00   Min.   :1.000   Min.   :-99.000  
 1st Qu.:1.00   1st Qu.:1.00   1st Qu.:2.000   1st Qu.:  3.605  
 Median :2.00   Median :2.00   Median :2.500   Median :  5.175  
 Mean   :1.51   Mean   :1.51   Mean   :2.503   Mean   :  3.862  
 3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:3.750   3rd Qu.:  6.230  
 Max.   :2.00   Max.   :2.00   Max.   :4.000   Max.   :  8.260  
     height            weight             age       
 Min.   :-99.000   Min.   :-99.000   Min.   :3.000  
 1st Qu.:  5.495   1st Qu.:  0.605   1st Qu.:3.000  
 Median :  6.910   Median :  1.640   Median :4.000  
 Mean   :  5.902   Mean   :  1.099   Mean   :3.507  
 3rd Qu.:  8.750   3rd Qu.:  3.435   3rd Qu.:4.000  
 Max.   : 10.900   Max.   :  6.930   Max.   :4.000  
#
poplarR1 <- read_csv(file = url(site), na = "-99")
summary(poplarR1)
      site           year        treatment        diameter         height      
 Min.   :1.00   Min.   :1.00   Min.   :1.000   Min.   :1.030   Min.   : 1.150  
 1st Qu.:1.00   1st Qu.:1.00   1st Qu.:2.000   1st Qu.:3.675   1st Qu.: 5.530  
 Median :2.00   Median :2.00   Median :2.500   Median :5.200   Median : 6.950  
 Mean   :1.51   Mean   :1.51   Mean   :2.503   Mean   :4.909   Mean   : 6.969  
 3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:3.750   3rd Qu.:6.235   3rd Qu.: 8.785  
 Max.   :2.00   Max.   :2.00   Max.   :4.000   Max.   :8.260   Max.   :10.900  
                                               NA's   :3       NA's   :3       
     weight           age       
 Min.   :0.010   Min.   :3.000  
 1st Qu.:0.635   1st Qu.:3.000  
 Median :1.680   Median :4.000  
 Mean   :2.117   Mean   :3.507  
 3rd Qu.:3.470   3rd Qu.:4.000  
 Max.   :6.930   Max.   :4.000  
 NA's   :3                      
str(poplarR1)
spc_tbl_ [298 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ site     : num [1:298] 1 1 1 1 1 1 1 1 1 2 ...
 $ year     : num [1:298] 1 1 1 1 1 1 1 1 1 1 ...
 $ treatment: num [1:298] 1 1 1 1 1 1 1 1 1 1 ...
 $ diameter : num [1:298] 2.23 2.12 1.06 2.12 2.99 4.01 2.41 2.75 2.2 4.09 ...
 $ height   : num [1:298] 3.76 3.15 1.85 3.64 4.64 5.25 4.07 4.72 4.17 5.73 ...
 $ weight   : num [1:298] 0.17 0.15 0.02 0.16 0.37 0.73 0.22 0.3 0.19 0.78 ...
 $ age      : num [1:298] 3 3 3 3 3 3 3 3 3 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   site = col_double(),
  ..   year = col_double(),
  ..   treatment = col_double(),
  ..   diameter = col_double(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   age = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Using data.table function fread()

R Code
poplarF <- fread(input = site, na.strings = "-99")
str(poplarF)
Classes 'data.table' and 'data.frame':  298 obs. of  7 variables:
 $ site     : int  1 1 1 1 1 1 1 1 1 2 ...
 $ year     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
 $ diameter : num  2.23 2.12 1.06 2.12 2.99 4.01 2.41 2.75 2.2 4.09 ...
 $ height   : num  3.76 3.15 1.85 3.64 4.64 5.25 4.07 4.72 4.17 5.73 ...
 $ weight   : num  0.17 0.15 0.02 0.16 0.37 0.73 0.22 0.3 0.19 0.78 ...
 $ age      : int  3 3 3 3 3 3 3 3 3 3 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(poplarF)
      site           year        treatment        diameter         height      
 Min.   :1.00   Min.   :1.00   Min.   :1.000   Min.   :1.030   Min.   : 1.150  
 1st Qu.:1.00   1st Qu.:1.00   1st Qu.:2.000   1st Qu.:3.675   1st Qu.: 5.530  
 Median :2.00   Median :2.00   Median :2.500   Median :5.200   Median : 6.950  
 Mean   :1.51   Mean   :1.51   Mean   :2.503   Mean   :4.909   Mean   : 6.969  
 3rd Qu.:2.00   3rd Qu.:2.00   3rd Qu.:3.750   3rd Qu.:6.235   3rd Qu.: 8.785  
 Max.   :2.00   Max.   :2.00   Max.   :4.000   Max.   :8.260   Max.   :10.900  
                                               NA's   :3       NA's   :3       
     weight           age       
 Min.   :0.010   Min.   :3.000  
 1st Qu.:0.635   1st Qu.:3.000  
 Median :1.680   Median :4.000  
 Mean   :2.117   Mean   :3.507  
 3rd Qu.:3.470   3rd Qu.:4.000  
 Max.   :6.930   Max.   :4.000  
 NA's   :3                      

Graphing Now

Basic scatterplot:

R Code
ggplot(data = poplarC, 
       mapping = aes(x = diameter, y = height)) + 
  geom_point() 

#
ggplot(data = poplarC, 
       mapping = aes(x = diameter, y = height, color = treatment)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(se = FALSE)

#
ggplot(data = poplarC, 
       mapping = aes(x = diameter, y = height, color = treatment)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(se = FALSE) + 
  facet_grid(cols = vars(age), rows = vars(site))