One Qualitative Predictor

Author

Most material from Probability and Statistics with R, Second Edition

Published

Last modified on September 17, 2024 08:54:45 Eastern Daylight Time

Qualitative Predictors

The simplest situation where dummy variables might be used in a regression model is when the qualitative predictor has only two levels. The regression model for a single quantitative predictor \((x_1)\) and a dummy variable \((D_1)\) is written

\[\begin{equation} Y = \beta_0 + \beta_1 x_1 + \beta_2 D_1 + \beta_3 x_1 D_1 + \varepsilon \end{equation}\] where \[\begin{equation*} D_1 = \begin{cases} 0 &\text{for the first level}\\ 1 &\text{for the second level.} \end{cases} \end{equation*}\]

The model above when \(D_1\) has two levels will yield one of four possible scenarios, as shown in Figure 1. This type of model requires the user to answer three basic questions:

Are the lines the same?
Are the slopes the same?
Are the intercepts the same?

Figure 1: Four possible results for a single dummy variable with two levels. Graph I has the intercept and the slope the same for both levels of the dummy variable. Graph II has the two lines with the same slope, but different intercepts. Graph III shows the two fitted lines with the same intercept but different slopes. Graph IV shows the two lines with different intercepts and different slopes.

To address whether the lines are the same, the null hypothesis \(H_0: \beta_2 = \beta_3 = 0\) must be tested. One way to perform the test is to use the general linear test statistic based on the full model and the reduced model \(Y = \beta_0 + \beta_1 x_1 + \varepsilon\). If the null hypothesis is not rejected, the interpretation is that there is one line present (the intercept and the slope are the same for both levels of the dummy variable). This is the case for graph I of Figure 1. If the null hypothesis is rejected, either the slopes, the intercepts, or possibly both the slope and the intercept are different for the different levels of the dummy variable, as seen in graphs II, III, and IV of Figure 1, respectively.

To answer whether the slopes are the same, the null hypothesis \(H_0: \beta_3 = 0\) must be tested. If the null hypothesis is not rejected, the two lines have the same slope, but different intercepts, as shown in graph II of Figure 1. The two parallel lines that result when \(\beta_3 = 0\) are

\[\begin{equation*} Y = \beta_0 + \beta_1 x_1 + \varepsilon \text{ for } (D_1 = 0) \quad\text{and}\quad Y = (\beta_0 +\beta_2) + \beta_1 x_1 + \varepsilon \text{ for } (D_1 = 1). \end{equation*}\]

When \(H_0: \beta_3=0\) is rejected, one concludes that the two fitted lines are not parallel as in graphs III and IV of Figure 1.

To answer whether the intercepts are the same, the null hypothesis \(H_0: \beta_2 = 0\) for the full model must be tested. The reduced model for this test is \(Y = \beta_0 + \beta_1 x_1 + \beta_3 x_1 D_1 + \varepsilon\). If the null hypothesis is not rejected, the two fitted lines have the same intercept but different slopes:

\[\begin{equation*} Y = \beta_0 + \beta_1 x_1 + \varepsilon \text{ for } (D_1 = 0) \quad\text{and}\quad Y = \beta_0 + (\beta_1 +\beta_3) x_1 + \varepsilon \text{ for } (D_1 = 1). \end{equation*}\]

Graph III of Figure 1 represents this situation. If the null hypothesis is rejected, one concludes that the two lines have different intercepts, as in graphs II and IV of Figure 1.

Example

Suppose a realtor wants to model the appraised price of an apartment as a function of the predictors living area (in m\(^2\)) and the presence or absence of elevators. Consider the data frameVIT2005, which contains data about apartments in Vitoria, Spain, including totalprice, area, and elevator, which are the appraised apartment value in Euros, living space in square meters, and the absence or presence of at least one elevator in the building, respectively. The realtor first wants to know if there is any relationship between appraised price \((Y)\) and living area \((x_1)\). Next, the realtor wants to know how adding a dummy variable for whether or not an elevator is present changes the relationship: Are the lines the same? Are the slopes the same? Are the intercepts the same?

Solution (is there a realationship between `totalprice` and `area`?):

R Code

Code

library(tidyverse)
library(PASWR2)
VIT2005 <- VIT2005 |> 
  mutate(elevator = factor(elevator, labels = c("No", "Yes")))
mod_simple <- lm(totalprice ~ area, data = VIT2005)
summary(mod_simple)


Call:
lm(formula = totalprice ~ area, data = VIT2005)

Residuals:
    Min      1Q  Median      3Q     Max 
-156126  -21564   -2155   19493  120674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40822.4    12170.1   3.354  0.00094 ***
area          2704.8      133.6  20.243  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40810 on 216 degrees of freedom
Multiple R-squared:  0.6548,    Adjusted R-squared:  0.6532 
F-statistic: 409.8 on 1 and 216 DF,  p-value: < 2.2e-16

Code

library(moderndive)
get_regression_table(mod_simple) |> 
  kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	40822.416	12170.092	3.354	0.001	16835.075	64809.757
area	2704.751	133.616	20.243	0.000	2441.393	2968.109

Code

ggplot(data = VIT2005, aes(x = area, y = totalprice)) +
  geom_point() +
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(labels = label_currency(prefix = "€")) +
  scale_x_continuous(labels = label_number(suffix = " square meters")) +
  labs(x = "Apartment Area", y = "Appraised Price")

Figure 2: Scatterplot of `totalprice` versus `area` with the fitted regression line superimposed from `mod_simple`

A linear regression model of the form

\[\begin{equation} Y= \beta_0 +\beta_1 x_1 + \varepsilon \end{equation}\]

is fit yielding

\[\begin{equation*} \widehat{Y}_i = 40,822 + 2,705 x_{i1} \end{equation*}\]

and a scatterplot of totalprice versus area with the fitted regression line superimposed over the scatterplot is shown in Figure 2.

Based on Figure 2, there appears to be a linear relationship between appraised price and living area. Further, this relationship is statistically significant, as the p-value for testing \(H_0: \beta_1=0\) versus \(H_1: \beta_1 \neq 0\) is less than \(2 \times 10^{-16}\).

Solution (does adding a dummy variable (`elevator`) change the relationship?):

The regression model including the dummy variable for elevator is written \[\begin{equation} Y = \beta_0 + \beta_1 x_1 + \beta_2 D_1 + \beta_3 x_1 D_1 + \varepsilon \end{equation}\] where \[\begin{equation*} D_1 = \begin{cases} 0 &\text{when a building has no elevators}\\ 1 &\text{when a building has at least one elevator.} \end{cases} \end{equation*}\]

To determine if the lines are the same (which means that the linear relationship between appraised price and living area is the same for apartments with and without elevators), the hypotheses are

\[\begin{equation*} H_0: \beta_2 = \beta_3 = 0 \text{ versus } H_1: \text{at least one } \beta_i \neq 0 \text{ for } i=2, 3. \end{equation*}\]

R Code

Code

mod_full <- lm(totalprice ~ area + elevator + area:elevator, data = VIT2005)
anova(mod_simple, mod_full)  # compare models

Analysis of Variance Table

Model 1: totalprice ~ area
Model 2: totalprice ~ area + elevator + area:elevator
  Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
1    216 3.5970e+11                                  
2    214 3.0267e+11  2 5.704e+10 20.165 9.478e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this problem, one may conclude that at least one of \(\beta_2\) and \(\beta_3\) is not zero since the p-value = \(9.4780144\times 10^{-9}\). In other words, the lines have either different intercepts, different slopes, or different intercepts and slopes.

To see if the lines have the same slopes (which means that the presence of an elevator adds constant value over all possible living areas), the hypotheses are \[\begin{equation*} H_0: \beta_3 = 0 \text{ versus } H_1: \beta_3 \neq 0. \end{equation*}\]

R Code

Code

anova(mod_full)

Analysis of Variance Table

Response: totalprice
               Df     Sum Sq    Mean Sq  F value    Pr(>F)    
area            1 6.8239e+11 6.8239e+11 482.4846 < 2.2e-16 ***
elevator        1 4.5308e+10 4.5308e+10  32.0352  4.83e-08 ***
area:elevator   1 1.1732e+10 1.1732e+10   8.2949   0.00438 ** 
Residuals     214 3.0267e+11 1.4143e+09                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value = \(0.0043797\), it may be concluded that \(\beta_3 \neq 0\), which implies that the lines are not parallel.

To test for equal intercepts (which means that appraised price with and without elevators starts at the same value), the hypotheses to be evaluated are \[\begin{equation*} H_0: \beta_2 = 0 \text{ versus }H_1: \beta_2 \neq 0. \end{equation*}\]

R Code

Code

mod_full <- lm(totalprice ~ area + elevator + area:elevator, data = VIT2005)
mod_inter <- lm(totalprice ~ area + area:elevator, data = VIT2005)
# compare models
anova(mod_inter, mod_full)

Analysis of Variance Table

Model 1: totalprice ~ area + area:elevator
Model 2: totalprice ~ area + elevator + area:elevator
  Res.Df        RSS Df  Sum of Sq      F Pr(>F)
1    215 3.0624e+11                            
2    214 3.0267e+11  1 3576497188 2.5288 0.1133

Since the p-value for testing the null hypothesis is 0.1132635, one fails to reject \(H_0\) and should conclude that the two lines have the same intercept but different slopes.

R Code

Code

summary(mod_inter)


Call:
lm(formula = totalprice ~ area + area:elevator, data = VIT2005)

Residuals:
    Min      1Q  Median      3Q     Max 
-125093  -21762   -2201   18117  112252 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71352.08   12309.18   5.797 2.39e-08 ***
area              1897.94     180.59  10.510  < 2e-16 ***
area:elevatorYes   553.99      90.42   6.127 4.23e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37740 on 215 degrees of freedom
Multiple R-squared:  0.7061,    Adjusted R-squared:  0.7034 
F-statistic: 258.3 on 2 and 215 DF,  p-value: < 2.2e-16

Code

coef(mod_inter)

     (Intercept)             area area:elevatorYes 
      71352.0844        1897.9368         553.9856

Code

b0 <- coef(mod_inter)[1]
b1NO <- coef(mod_inter)[2]
b1YES <- coef(mod_inter)[2] + coef(mod_inter)[3]
c(b0, b1NO, b1YES)

(Intercept)        area        area 
  71352.084    1897.937    2451.922

The fitted model is \(\widehat{Y}_i = 71,352 + 1,898x_{i1} + 554x_{i1}D_{i1}\), and the fitted regression lines for the two values of \(D_1\) are shown in Figure 3. The fitted model using the same intercept with different slopes has an \(R^2_a\) of 0.7033949, a modest improvement over the model without the variable elevator, which had an \(R^2_a\) value of 0.6532269.

Code

ggplot(data = VIT2005, aes(x = area, y = totalprice, color = elevator)) +
  geom_point(alpha = 0.5) +
  theme_bw() +
  geom_abline(intercept = b0,
              slope = b1NO,
              color = "red") +
  geom_abline(intercept = b0,
              slope = b1YES,
              color = "blue") +
  scale_color_manual(values = c("red", "blue")) +
  labs(x = "Apartment Area", 
       y = "Appraised Price",
       color = "Elevator") +
  scale_y_continuous(labels = label_currency(prefix = "€"),
                     limits = c(50000, 500000)) +
  scale_x_continuous(labels = label_number(suffix = " square meters"),
                     limits = c(10, 200))

Figure 3: Fitted regression lines for `mod_inter`

Diagnostics

R Code

Code

MDF <- get_regression_points(mod_inter)
ggplot(data = MDF, aes(x = totalprice_hat, y = residual)) +
  geom_point() +
  theme_bw() +
  labs(title = "Residuals versus Fitted Values") +
  geom_hline(yintercept = 0, lty = "dashed")

Code

ggplot(data = MDF, aes(x = residual)) +
  geom_histogram(fill = "lightblue", color = "blue") +
  theme_bw()

Code

ggplot(data = MDF, aes(sample = residual)) +
  geom_qq() +
  geom_qq_line() +
  theme_bw()

Example

Consider the MA_schools data frame from the moderndive package which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education. Consider a model with SAT math scores (average_sat_math) modeled as a function of percentage of the high school’s student body that are economically disadvantaged (perc_disadvan) and the a categorical variable measuring school size (size).

Solution (is there a relationship between `average_sat_math` and `perc_disadvan`?):

R Code

Code

ggplot(data = MA_schools, 
       aes(x = perc_disadvan, y = average_sat_math)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(method = "lm", se = FALSE)

Code

mod_simple <- lm(average_sat_math ~ perc_disadvan, 
                 data = MA_schools)
summary(mod_simple)


Call:
lm(formula = average_sat_math ~ perc_disadvan, data = MA_schools)

Residuals:
   Min     1Q Median     3Q    Max 
-80.74 -21.26  -4.12  18.54 174.17 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   581.2811     3.2668   177.9   <2e-16 ***
perc_disadvan  -2.7798     0.1011   -27.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.54 on 330 degrees of freedom
Multiple R-squared:  0.6962,    Adjusted R-squared:  0.6953 
F-statistic: 756.2 on 1 and 330 DF,  p-value: < 2.2e-16

Code

get_regression_table(mod_simple) |> 
  kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	581.281	3.267	177.936	0	574.855	587.707
perc_disadvan	-2.780	0.101	-27.499	0	-2.979	-2.581

You complete the rest….

Solution (does adding a dummy variable `size` change the relationship?):

R Code

Code

ggplot(data = MA_schools, 
       aes(x = perc_disadvan, y = average_sat_math, color = size)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(method = "lm", se = FALSE) -> p1
ggplot(data = MA_schools, 
       aes(x = perc_disadvan, y = average_sat_math, color = size)) + 
  geom_point() +
  theme_bw() + 
  geom_parallel_slopes(se = FALSE) -> p2
library(patchwork)
p1 + p2

Code

mod_full <- lm(lm(average_sat_math ~ perc_disadvan + size + perc_disadvan:size, data = MA_schools))
anova(mod_simple, mod_full) |> 
  kable()

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
330	371190.9	NA	NA	NA	NA
326	367669.4	4	3521.459	0.7805895	0.5384385

Code

anova(mod_full) |> 
  kable()

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
perc_disadvan	1	850614.7701	850614.7701	754.2112099	0.0000000
size	2	3132.6259	1566.3130	1.3887965	0.2508467
perc_disadvan:size	2	388.8332	194.4166	0.1723826	0.8417338
Residuals	326	367669.4425	1127.8204	NA	NA

Simpson’s Paradox

R Code

Code

library(ISLR)
credit_paradox <- Credit |> 
  select(ID, debt = Balance, credit_limit = Limit, 
         credit_rating = Rating, income = Income, age = Age)
ggplot(data = credit_paradox, aes(x = credit_limit, y = debt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() -> p1
ggplot(data = credit_paradox, aes(x = income, y = debt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() -> p2
library(patchwork)
p1 + p2

Code

library(plotly)
p <- plot_ly(data = credit_paradox, z = ~debt, x = ~credit_limit, y = ~income) |> 
  add_markers()
p

Code

mod <- lm(debt ~ credit_limit + income, data = credit_paradox)
summary(mod)$coef

                 Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)  -385.1792604 19.464801525 -19.78850  3.878764e-61
credit_limit    0.2643216  0.005879729  44.95471 7.717386e-158
income         -7.6633230  0.385072058 -19.90101  1.260933e-61

Code

x <- seq(min(credit_paradox$credit_limit), max(credit_paradox$credit_limit), length = 70)
y <- seq(min(credit_paradox$income), max(credit_paradox$income), length = 70) 
plane <- outer(x, y, function(a, b){coef(mod)[1] + coef(mod)[2]*a + coef(mod)[3]*b})
# draw the plane
p |> 
  add_surface(x = ~x, y = ~y, z = ~plane)

Code

qs <- quantile(credit_paradox$credit_limit, probs = seq(0, 1, .25))
# credit_paradox$credit_cats <- cut(credit_paradox$credit_limit, breaks = qs, include.lowest = TRUE)
############### Or above
credit_paradox <- credit_paradox |> 
  mutate(credit_cats = cut(credit_limit, breaks = qs, include.lowest = TRUE))
head(credit_paradox)

  ID debt credit_limit credit_rating  income age         credit_cats
1  1  333         3606           283  14.891  34 (3.09e+03,4.62e+03]
2  2  903         6645           483 106.025  82 (5.87e+03,1.39e+04]
3  3  580         7075           514 104.593  71 (5.87e+03,1.39e+04]
4  4  964         9504           681 148.924  36 (5.87e+03,1.39e+04]
5  5  331         4897           357  55.882  68 (4.62e+03,5.87e+03]
6  6 1151         8047           569  80.180  77 (5.87e+03,1.39e+04]

Code

ggplot(data = credit_paradox, aes(x = credit_limit)) +
  geom_density(fill = "pink", color = "black") + 
  geom_vline(xintercept = qs, color = "blue") + 
  theme_bw()

Code

credit_paradox |> 
  group_by(credit_cats) |> 
  summarize(n())

# A tibble: 4 × 2
  credit_cats         `n()`
  <fct>               <int>
1 [855,3.09e+03]        100
2 (3.09e+03,4.62e+03]   100
3 (4.62e+03,5.87e+03]   100
4 (5.87e+03,1.39e+04]   100

Code

p1 <- ggplot(data = credit_paradox, aes(x = income, y = debt)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() + 
  labs(y = "Credit card debt (in $)",
       x = "Income (in $1000)")
p2 <- ggplot(data = credit_paradox, aes(x = income, y = debt, color = credit_cats)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() + 
  labs(y = "Credit card debt (in $)",
       x = "Income (in $1000)",
       color = "Credit limit bracket")
p1 + p2

Relationship between credit card debt and income by credit limit bracket

Other Formats

Qualitative Predictors

Example

Solution (is there a realationship between totalprice and area?):

Solution (does adding a dummy variable (elevator) change the relationship?):

Diagnostics

Example

Solution (is there a relationship between average_sat_math and perc_disadvan?):

Solution (does adding a dummy variable size change the relationship?):

Simpson’s Paradox

Solution (is there a realationship between `totalprice` and `area`?):

Solution (does adding a dummy variable (`elevator`) change the relationship?):

Solution (is there a relationship between `average_sat_math` and `perc_disadvan`?):

Solution (does adding a dummy variable `size` change the relationship?):