7 Data Management4

Examining frequency distributions for each of your variables is the key to further guiding the decision making involved in quantitative research.


A random sample of 1,200 U.S. college students were asked the following questions as part of a larger survey: “What is your perception of your own body? Do you feel that you are overweight, underweight, or about right?” The following table shows part of the data (5 of the 1200 observations);

Student 25 Overweight
Student 26 About Right
Student 27 Underweight
Student 28 About Right
Student 29 About Right

Here is some information that would be interesting to get from these data:

  • What percentage of the sampled students fall into each category?
  • How are students divided across the three body image categories?

Are they equally divided? If not, do the percentages follow some other kind of pattern?

There is no way that we can answer these questions by looking at the raw data, which are in the form of a long list of 1,200 responses and thus not very useful. However, both these questions will be easily answered once we summarize and look at the frequency distribution of the variable BodyImage (i.e., once we summarize how often each of the categories occurs).

In order to summarize the distribution of a categorical variable, we ask our statistical software program to create a table of the different values (categories) the variable takes, how many times each value occurs (count), and, more importantly, how often each value occurs (percentages). Here is the table (i.e. frequency distribution) for our example:

Table 7.1: Body Image Distribution
About Right 855 71.3%
Overweight 235 19.6%
Underweight 110 9.2%
Total 1200 100%

Please watch the Chapter 07 video.

Data Management with R

During the class session, we will begin to work through how to make decisions about data management and how to put those decisions into action.

An understanding of basic operations to be used with your statistical software is a good place to start. In R, logical operators include ! (not), & and && (logical AND), | and || (logical OR). Relational operators in R include < (less than), > (greater than), <= (less than or equal), == (vector equality), and != (not equal).


1. Need to identify missing data

Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value, you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.

> title_of_data_set$VAR1[title_of_data_set$VAR1 == 9] <- NA

2. Need to recode responses to “no” based on skip patterns

There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer is “no”, they have never smoked marijuana daily. This would need to be explicitly recoded. Note that we commonly code a no as 0 and a yes as 1.

> title_of_data_set$VAR1[is.na(title_of_data_set$VAR1)] <- 0

3. Need to collapse response categories

If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternatively, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. Consider the variable S1Q6A from the data frame NESARC which has 14 levels that record the highest level of education of the participant. To collapse the categories into a dichotomous variable that indicates the presence of a high school degree, use the ifelse function. The levels 1, 2, 3, 4, 5, 6, and 7 of the variable S1Q6A correspond to education levels less than completing high school.

> library(PDS)
> NESARC$HS_DEGREE <- factor(ifelse(NESARC$S1Q6A %in% c("1", "2", "3", "4", "5", "6", "7"), "No", "Yes"))
   No   Yes 
 7849 35244 

4. Need to aggregate variables

In many cases, you will want to combine multiple variables into one. Consider creating create a new variable DepressLife which is Yes if the variable MAJORLIFE or DYSLIFE is a 1 (data frame NESARC).

> NESARC$DepressLife <- factor(ifelse( (NESARC$MAJORDEPLIFE == 1 | NESARC$DYSLIFE == 1), "Yes", "No"))
> summary(NESARC$DepressLife)
   No   Yes 
34894  8199 

5. Need to create continuous variables

If you are working with a number of items that represent a single construct, it may be useful to create a composite variable/score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent one composite score.

> nd_sum <- title_of_data_set$nd_symptom1 + title_of_data_set$nd_symptom2 + title_of_data_set$nd_symptom3
> title_of_data_set$nd_sum <- nd_sum

6. Labeling variable responses/values

Given that nominal and ordinal variables have, or are given numeric response values (i.e. dummy codes), it can be useful to label those values so that the labels are displayed in your output.

> levels(title_of_data_set$VARIABLE) <- c("value", "value")

7. Need to further subset the sample

When using large data sets, it is often necessary to subset the data so that you are including only those observations that can assist in answering your particular research question. In these cases, you may want to select your own sample from within the survey’s sampling frame. For example, if you are interested in identifying demographic predictors of depression among Type II diabetes patients, you would plan to subset the data to subjects endorsing Type II Diabetes.

> title_of_subsetted_data <- title_of_data_set["diabetes2" == 1, ]
> # OR using dplyr
> library(dplyr)
> title_of_subsetted_data <- filter(title_of_data_set, "diabetes2" == 1)

Three different approaches to subsetting data will be illustrated. The first approach is to use the dplyr function filter; the second approach is to use indices; and the third approach is to use the function subset. Consider creating a subset of the NESARC data set where a person indicates

  1. He/she has smoked over 100 cigarettes (S3AQ1A == 1)
  2. He/she has smoked in the past year (CHECK321 == 1)
  3. He/she has typically smoked every day over the last year (S3AQ3B1 == 1)
  4. He/she is less than or equal to 25 years old (AGE <= 25)

The first approach uses the filter function with the %>% function. Although it is not a requirement, the data frame NESARC is converted to a data frame tbl per the advice given in the dplyr vignette.

NESARCsub1 <- tbl_df(NESARC) %>%
  filter(S3AQ1A == 1 & CHECK321 == 1 & S3AQ3B1 == 1 & AGE <= 25)
[1] 1320 3010

The second approach uses standard indexing.

                       NESARC$S3AQ3B1 == 1 & NESARC$AGE <= 25, ]
[1] 1320 3010

The third approach uses the subset function.

NESARCsub3 <- subset(NESARC, subset = S3AQ1A == 1 & CHECK321 == 1 & 
                       S3AQ3B1 == 1 & AGE <= 25)
[1] 1320 3010

NOTE: Often, you will need to create groups or sub-samples from the data set for the purpose of making comparisons. It is important to be certain that the groups that you would like to compare are of adequate size and number. For example, if you were interested in comparing complications of depression in parents who had lost a child through miscarriage vs. parents who had lost a child in the first year of life, it would be important to have large enough groups of each. It would not be appropriate to attempt to compare 5000 observations in the miscarriage group to only 9 observations in the first year group.

Extended Example

Use the data frame NESARC and create a new variable (NumberNicotineSymptoms) that is the sum of all of the nicotine dependence symptoms where a person indicates

  1. He/she has smoked over 100 cigarettes (S3AQ1A == 1)
  2. He/she has smoked in the past year (CHECK321 == 1)
  3. He/she has typically smoked every day over the last year (S3AQ3B1 == 1)
  4. He/she is less than or equal to 25 years old (AGE <= 25)

The following code selects the 67 variables that deal with nicotine dependence using the select and contains functions from dplyr. The ifelse function is used in the myfix function to convert values of 2 and 9 to 0.

DF <- tbl_df(NESARC) %>%
  filter(S3AQ1A ==1 & S3AQ3B1 == 1 & CHECK321 == 1 & AGE <= 25) %>% 
myfix <- function(x){ifelse(x %in% c(2, 9), 0, ifelse(x == 1, 1, NA))}
DF2 <- as.data.frame(apply(DF, 2, myfix))
DF2$NumberNicotineSymptoms <- apply(DF2, 1, sum, na.rm = TRUE)
nesarc <- tbl_df(NESARC) %>% 
  filter(S3AQ1A ==1 & S3AQ3B1 == 1 & CHECK321 == 1 & AGE <= 25) %>% 
  rename(Ethnicity = ETHRACE2A, Age = AGE, MajorDepression = MAJORDEPLIFE, 
         Sex = SEX, TobaccoDependence = TAB12MDX, DailyCigsSmoked = S3AQ3C1,
         AlcoholAD = ALCABDEPP12DX) %>% 
  select(Ethnicity, Age, MajorDepression, TobaccoDependence, DailyCigsSmoked, Sex, AlcoholAD)
nesarc <- data.frame(nesarc, NumberNicotineSymptoms = DF2$NumberNicotineSymptoms)
nesarc <- tbl_df(nesarc)
# Code 99 properly
nesarc$DailyCigsSmoked[nesarc$DailyCigsSmoked == 99] <- NA
# Create smoking categories
nesarc$DCScat <- cut(nesarc$DailyCigsSmoked, breaks = c(0, 5, 10, 15, 20, 98), include.lowest = FALSE)
# Label factors
nesarc$Ethnicity <- factor(nesarc$Ethnicity, 
                           labels = c("Caucasian", "African American", 
                                      "Native American", "Asian", "Hispanic"))
nesarc$TobaccoDependence <- factor(nesarc$TobaccoDependence, 
                                   labels = c("No Nicotine Dependence", 
                                              "Nicotine Dependence"))
nesarc$Sex <- factor(nesarc$Sex, labels =c("Female", "Male"))
nesarc$MajorDepression <- factor(nesarc$MajorDepression, 
                                 labels =c("No Depression", "Yes Depression"))
nesarc$AlcoholAD <- factor(nesarc$AlcoholAD, labels = c("No Alcohol", "Alcohol Abuse", "Alcohol Dependence", "Alcohol Abuse and Dependence"))
[1] 1320    9

See the PDS vignette for additional examples.

Data Management Assignment

Push to your private repository on GitHub:

  1. A program that manages your data.
  2. Output that displays 3 of your secondary (i.e. data managed) variables as frequency tables.