7 Data Management4
Examining frequency distributions for each of your variables is the key to further guiding the decision making involved in quantitative research.
EXAMPLE:
A random sample of 1,200 U.S. college students were asked the following questions as part of a larger survey: “What is your perception of your own body? Do you feel that you are overweight, underweight, or about right?” The following table shows part of the data (5 of the 1200 observations);
STUDENT | BODY IMAGE |
---|---|
Student 25 | Overweight |
Student 26 | About Right |
Student 27 | Underweight |
Student 28 | About Right |
Student 29 | About Right |
Here is some information that would be interesting to get from these data:
- What percentage of the sampled students fall into each category?
- How are students divided across the three body image categories?
Are they equally divided? If not, do the percentages follow some other kind of pattern?
There is no way that we can answer these questions by looking at the raw data, which are in the form of a long list of 1,200 responses and thus not very useful. However, both these questions will be easily answered once we summarize and look at the frequency distribution of the variable BodyImage
(i.e., once we summarize how often each of the categories occurs).
In order to summarize the distribution of a categorical variable, we ask our statistical software program to create a table of the different values (categories) the variable takes, how many times each value occurs (count), and, more importantly, how often each value occurs (percentages). Here is the table (i.e. frequency distribution) for our example:
CATEGORY | COUNT | PERCENTAGE |
---|---|---|
About Right | 855 | 71.3% |
Overweight | 235 | 19.6% |
Underweight | 110 | 9.2% |
Total | 1200 | 100% |
Please watch the Chapter 07 video below.
Data Management with R
During the class session, we will begin to work through how to make decisions about data management and how to put those decisions into action.
An understanding of basic operations to be used with your statistical software is a good place to start. In R
, logical operators include !
(not), &
and &&
(logical AND), |
and ||
(logical OR). Relational operators in R
include <
(less than), >
(greater than), <=
(less than or equal), ==
(vector equality), and !=
(not equal).
Examples:
1. Need to identify missing data
Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value, you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.
> title_of_data_set$VAR1[title_of_data_set$VAR1 == 9] <- NA
2. Need to recode responses to “no” based on skip patterns
There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer is “no”, they have never smoked marijuana daily. This would need to be explicitly recoded. Note that we commonly code a no as 0 and a yes as 1.
> title_of_data_set$VAR1[is.na(title_of_data_set$VAR1)] <- 0
3. Need to collapse response categories
If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternatively, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. Consider the variable S1Q6A
from the data frame NESARC
which has 14 levels that record the highest level of education of the participant. To collapse the categories into a dichotomous variable that indicates the presence of a high school degree, use the ifelse
function. The levels 1
, 2
, 3
, 4
, 5
, 6
, and 7
of the variable S1Q6A
correspond to education levels less than completing high school.
> library(PDS)
> NESARC$HS_DEGREE <- factor(ifelse(NESARC$S1Q6A %in% c("1", "2", "3", "4", "5", "6", "7"), "No", "Yes"))
> summary(NESARC$HS_DEGREE)
No Yes
7849 35244
4. Need to aggregate variables
In many cases, you will want to combine multiple variables into one. Consider creating create a new variable DepressLife
which is Yes
if the variable MAJORLIFE
or DYSLIFE
is a 1 (data frame NESARC
).
> NESARC$DepressLife <- factor(ifelse( (NESARC$MAJORDEPLIFE == 1 | NESARC$DYSLIFE == 1), "Yes", "No"))
> summary(NESARC$DepressLife)
No Yes
34894 8199
5. Need to create continuous variables
If you are working with a number of items that represent a single construct, it may be useful to create a composite variable/score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1
and no=0
and then sum the items so that they represent one composite score.
> nd_sum <- title_of_data_set$nd_symptom1 + title_of_data_set$nd_symptom2 + title_of_data_set$nd_symptom3
> title_of_data_set$nd_sum <- nd_sum
6. Labeling variable responses/values
Given that nominal and ordinal variables have, or are given numeric response values (i.e. dummy codes), it can be useful to label those values so that the labels are displayed in your output.
> levels(title_of_data_set$VARIABLE) <- c("value", "value")
7. Need to further subset the sample
When using large data sets, it is often necessary to subset the data so that you are including only those observations that can assist in answering your particular research question. In these cases, you may want to select your own sample from within the survey’s sampling frame. For example, if you are interested in identifying demographic predictors of depression among Type II diabetes patients, you would plan to subset the data to subjects endorsing Type II Diabetes.
> title_of_subsetted_data <- title_of_data_set["diabetes2" == 1, ]
> # OR using dplyr
> library(dplyr)
> title_of_subsetted_data <- filter(title_of_data_set, "diabetes2" == 1)
Three different approaches to subsetting data will be illustrated. The first approach is to use the dplyr
function filter
; the second approach is to use indices; and the third approach is to use the function subset
. Consider creating a subset of the NESARC
data set where a person indicates
- He/she has smoked over 100 cigarettes (
S3AQ1A == 1
) - He/she has smoked in the past year (
CHECK321 == 1
) - He/she has typically smoked every day over the last year (
S3AQ3B1 == 1
) - He/she is less than or equal to 25 years old (
AGE <= 25
)
The first approach uses the filter
function with the %>%
function. Although it is not a requirement, the data frame NESARC
is converted to a data frame tbl
per the advice given in the dplyr
vignette.
library(PDS)
library(dplyr)
NESARCsub1 <- tbl_df(NESARC) %>%
filter(S3AQ1A == 1 & CHECK321 == 1 & S3AQ3B1 == 1 & AGE <= 25)
dim(NESARCsub1)
[1] 1320 3010
The second approach uses standard indexing.
NESARCsub2 <- NESARC[NESARC$S3AQ1A == 1 & NESARC$CHECK321 == 1 &
NESARC$S3AQ3B1 == 1 & NESARC$AGE <= 25, ]
dim(NESARCsub2)
[1] 1320 3010
The third approach uses the subset
function.
NESARCsub3 <- subset(NESARC, subset = S3AQ1A == 1 & CHECK321 == 1 &
S3AQ3B1 == 1 & AGE <= 25)
dim(NESARCsub3)
[1] 1320 3010
NOTE: Often, you will need to create groups or sub-samples from the data set for the purpose of making comparisons. It is important to be certain that the groups that you would like to compare are of adequate size and number. For example, if you were interested in comparing complications of depression in parents who had lost a child through miscarriage vs. parents who had lost a child in the first year of life, it would be important to have large enough groups of each. It would not be appropriate to attempt to compare 5000 observations in the miscarriage group to only 9 observations in the first year group.
Extended Example
Use the data frame NESARC
and create a new variable (NumberNicotineSymptoms
) that is the sum of all of the nicotine dependence symptoms where a person indicates
- He/she has smoked over 100 cigarettes (
S3AQ1A == 1
) - He/she has smoked in the past year (
CHECK321 == 1
) - He/she has typically smoked every day over the last year (
S3AQ3B1 == 1
) - He/she is less than or equal to 25 years old (
AGE <= 25
)
The following code selects the 67 variables that deal with nicotine dependence using the select
and contains
functions from dplyr
. The ifelse
function is used in the myfix
function to convert values of 2 and 9 to 0.
library(PDS)
library(dplyr)
DF <- tbl_df(NESARC) %>%
filter(S3AQ1A ==1 & S3AQ3B1 == 1 & CHECK321 == 1 & AGE <= 25) %>%
select(contains("S3AQ8"))
myfix <- function(x){ifelse(x %in% c(2, 9), 0, ifelse(x == 1, 1, NA))}
DF2 <- as.data.frame(apply(DF, 2, myfix))
DF2$NumberNicotineSymptoms <- apply(DF2, 1, sum, na.rm = TRUE)
nesarc <- tbl_df(NESARC) %>%
filter(S3AQ1A ==1 & S3AQ3B1 == 1 & CHECK321 == 1 & AGE <= 25) %>%
rename(Ethnicity = ETHRACE2A, Age = AGE, MajorDepression = MAJORDEPLIFE,
Sex = SEX, TobaccoDependence = TAB12MDX, DailyCigsSmoked = S3AQ3C1,
AlcoholAD = ALCABDEPP12DX) %>%
select(Ethnicity, Age, MajorDepression, TobaccoDependence, DailyCigsSmoked, Sex, AlcoholAD)
nesarc <- data.frame(nesarc, NumberNicotineSymptoms = DF2$NumberNicotineSymptoms)
nesarc <- tbl_df(nesarc)
# Code 99 properly
nesarc$DailyCigsSmoked[nesarc$DailyCigsSmoked == 99] <- NA
# Create smoking categories
nesarc$DCScat <- cut(nesarc$DailyCigsSmoked, breaks = c(0, 5, 10, 15, 20, 98), include.lowest = FALSE)
# Label factors
nesarc$Ethnicity <- factor(nesarc$Ethnicity,
labels = c("Caucasian", "African American",
"Native American", "Asian", "Hispanic"))
nesarc$TobaccoDependence <- factor(nesarc$TobaccoDependence,
labels = c("No Nicotine Dependence",
"Nicotine Dependence"))
nesarc$Sex <- factor(nesarc$Sex, labels =c("Female", "Male"))
nesarc$MajorDepression <- factor(nesarc$MajorDepression,
labels =c("No Depression", "Yes Depression"))
nesarc$AlcoholAD <- factor(nesarc$AlcoholAD, labels = c("No Alcohol", "Alcohol Abuse", "Alcohol Dependence", "Alcohol Abuse and Dependence"))
#
dim(nesarc)
[1] 1320 9
See the PDS
vignette for additional examples.
Data Management Assignment
Push to your private repository on GitHub:
- A program that manages your data.
- Output that displays 3 of your secondary (i.e. data managed) variables as frequency tables.