8 Graphing: One Variable at a Time

One Categorical Variable

Please watch the Chapter 08 Video below.

Consider the data frame EPIDURALF from the PASWR2 package which records intermediate results from a study to determine whether the traditional sitting position or the hamstring stretch position is superior for administering epidural anesthesia to pregnant women in labor as measured by the number of obstructive (needle to bone) contacts. In this study, there were four physicians. To summarize the number of patients treated by each physician we can use the function xtabs.

library(PASWR2)
xtabs(~doctor, data = EPIDURALF)

doctor
  A   B   C   D 
 61 115  93  73

A barplot of the number of patients treated by each physician (doctor) using ggplot2 is constructed below.

library(ggplot2)
ggplot(data = EPIDURALF, aes(x = doctor)) +
  geom_bar(fill = "lightblue") + 
  theme_bw()

Here is some information that would be interesting to get from these data:

What percentage of the patients were treated by each physician?

prop.table(xtabs(~doctor, data = EPIDURALF))

doctor
        A         B         C         D 
0.1783626 0.3362573 0.2719298 0.2134503

How are patients divided across physicians? Are they equally divided? If not, do the percentages follow some other kind of pattern?

One Quantitative Variable

We have explored the distribution of a categorical variable using a bar chart supplemented by numerical measures (percent of observations in each category). In this section, we will learn how to display the distribution of a quantitative variable.

To display data from one quantitative variable graphically, we typically use the histogram.

Example⁵

Break the following range of values into intervals and count how many observations fall into each interval.

Exam Grades

Here are the exam grades of 15 students: 88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73

We first need to break the range of values into intervals (also called “bins” or “classes”). In this case, since our dataset consists of exam scores, it will make sense to choose intervals that typically correspond to the range of a letter grade, 10 points wide: 40-50, 50-60, … 90-100. By counting how many of the 15 observations fall in each of the intervals, we get the following table:

SCORE	COUNT
[40,50)	1
[50,60)	2
[60,70)	4
[70,80)	5
[80,90)	2
[90,100)	1

To construct the histogram from this table we plot the intervals on the \(X\)-axis, and show the number of observations in each interval (frequency of the interval) on the \(Y\)-axis, which is represented by the height of a rectangle located above the interval:

Interpreting the Histogram

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern. More specifically, we should consider the following features of the distribution:

Shape
Center
Spread
Outliers

We will get a sense of the overall pattern of the data from the histogram’s center, spread, and shape, while outliers will highlight deviations from that pattern.

Shape

When describing the shape of a distribution, we should consider:

Symmetry/skewness of the distribution.
Peakedness (modality)—the number of peaks (modes) the distribution has.

We distinguish between:

https://oli.cmu.edu/jcourse/workbook/activity/page?context=434b848b80020ca600e242e2129b8f6e ↩