This book is Work in Progress. I appreciate your feedback to make the book better.

Chapter 12 Interaction Models

12.1 Motivation

A sports doctor routinely measures the muscle percentages of his clients. He also asks them how many hours per week they typically spend on training. Our doctor suspects that clients who train more are also more muscled. Furthermore, he thinks that the effect of training on muscularity declines with age. In multiple regression analysis, this is known as a moderation or interaction effect illustrated in the figure below.

In mathematical terms an interaction is an additional multiplication term:

\[ muscle = training + age + training*age \]

12.2 Data & Sample

In the following we further explore the aforementioned income-age relationship and how the relationship itself may be influenced by gender and other characteristics.

library(haven)
master <- read_dta("https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/practice_en/practice_dataset_eng.dta?raw=true")

# The data comes with Stata labels that do not work with all tidyverse commands
library(sjlabelled)
soep <- remove_all_labels(master)

# Rename German to English variable names
library(tidyverse)
soep <- soep %>% 
  rename("Age" = "alter",
         "Income" = "einkommenj1",
         "NACE2" = "branche",
         "Persons in HH" = "anz_pers", 
         "Num_Kids" = "anz_kind",
         "Education" = "bildung",
         "Health" = "gesund_org", 
         "Satisfaction" = "lebensz_org",
         "Emp" = "erwerb") 

# Modify some variables
soep <- soep %>% 
  mutate(Single = ifelse(`Persons in HH` == 1, 1, 0),
         Kids = ifelse(Num_Kids == 0, 0, 1),
         Female = sex) %>% 
  mutate(Industry = case_when(NACE2 %in% c(1,2,3) ~ "Agriculture",
                              NACE2 %in% c(5:9) ~ "Mining",
                              NACE2 %in% c(10:32) ~ "Manufacturing",
                              NACE2 %in% c(35:38) ~ "Energy",
                              NACE2 %in% c(41:43) ~ "Construction",
                              NACE2 %in% c(50,51,52,55) ~ "Trade",
                              NACE2 %in% c(60,61,62,63,64) ~ "Transport",
                              NACE2 %in% c(65,66,67) ~ "Banking",
                              NACE2 %in%  c(70,71,72,73,74,75,80,85,90,91,92,93,95,98,99) ~ "Services",
                              TRUE ~ "Other"))

# Round annual income to two digits
soep <- soep %>% mutate(Income = round(Income, 2))

# Build the estimation sample based on the topic
soep <- soep %>% 
  filter(Emp == 1) %>% 
  filter(Age <= 65) %>% 
  filter(Income > 1) %>% 
  filter(syear == 2019)

# Conduct a complete case analysis
soep <- soep %>% filter(complete.cases(.))

The datasummary() function from modelsummary package provides a good overview for clean, selected variables.

library(modelsummary)
soep %>%
  dplyr::select(Income, Age, Kids, Female, Satisfaction) %>%
  datasummary_skim()
Unique (#) Missing (%) Mean SD Min Median Max
Income 1211 0 42145.2 23488.8 915.5 37337.3 257886.2
Age 47 0 45.4 11.2 18.0 47.0 65.0
Kids 2 0 0.4 0.5 0.0 0.0 1.0
Female 2 0 0.4 0.5 0.0 0.0 1.0
Satisfaction 11 0 7.6 1.4 0.0 8.0 10.0

The average annual income in this sample is 42145. Notice that the income is left skewed whereas life satisfaction is skewed to the right. As expected, most people earn a low or average income and there are several extreme high incomes at the right. On the other hand, most people are on average pretty satisfied with their lifes in general. The mean of a dummy variables is the share of people with the characteristic coded with 1, i.e. about 40 % of the people have kids. In the following, binary dummy variables are coded as factors.

# Explicitly code factor variables
soep <- soep %>% 
  mutate(Female = factor(Female),
         Kids = factor(Kids))

12.3 Throwback Parallel Slopes

Simple regression is very similar to the correlation between two variables. A multiple regression with one continuous and one dummy variable is called a parallel slopes model. The data looks like this:

Conclusion from Parallel Slopes: The fundamental truth is that the effect of age on income is identical across gender (i.e. for males and females). Males and female are equipped (by nature, for some reason) with a different starting annual income on average. -- Convinced?

12.4 Regression with Moderators

In the following we consider all possible kinds of two-way interactions (between two variables). Higher-order interaction might be useful under special conditions, e.g. in a study where the effect of a new drug depends on the patient's age and gender.

An interaction can be coded implicitly or explicitly in R.

# implicit coding, that automatically extends
lm1 <- lm(mpg ~ vs*am, data=mtcars)
# explicit coding, each single term and their interaction
lm2 <- lm(mpg ~ vs + am + vs:am, data=mtcars)
# only use the interaction term?
lm3 <- lm(mpg ~ vs:am, data=mtcars)

modelsummary(title = 'Coding Interaction in R.',
             list("Version 1" = lm1, 
                  "Version 2" = lm2,
                  "Version 3" = lm3), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.|F')
Table 12.1: Coding Interaction in R.
 Version 1  Version 2  Version 3
(Intercept) 15.050 15.050 17.772
(1.002) (1.002) (0.826)
vs 5.693 5.693
(1.651) (1.651)
am 4.700 4.700
(1.736) (1.736)
vs × am 2.929 2.929 10.599
(2.541) (2.541) (1.766)
Num.Obs. 32 32 32

The single terms vs and am are sometimes called the main effects.

12.4.1 Dummy * Dummy

We estimate how income depends on having kids and being female.

interact1 <- lm(Income ~ Kids*Female, data=soep)
modelsummary(title = 'Interaction Model 1.',
             list("Income" = interact1), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.|F')
Table 12.2: Interaction Model 1.
 Income
(Intercept) 43476.056
(1128.318)
Kids1 4329.056
(1650.629)
Female1 -6131.395
(1727.116)
Kids1 × Female1 -8708.424
(2954.627)
Num.Obs. 1211

We plot the interaction with via cat_plot() from the interactions package. This interaction dedicated package makes visualization easy.

library(interactions)
cat_plot(interact1, pred = Kids, modx = Female)

This graph suggests that men earn more than women. Men with kids earn more than men without kids. The kids-effect is reversed for women, which report higher incomes when they do not have kids.

12.4.2 Categorical * Dummy

We estimate how income depends on working in different industries and being female.

interact2 <- lm(Income ~ Industry*Female, data=soep)

library(broom)
library(knitr)

# tidy(interact2) %>% 
#   select(term, estimate, std.error) %>% 
#   #filter(! is.na(estimate)) %>% 
#   kable(digits=3)

# modelsummary(title = 'Interaction Model 2.',
#              list("Income" = interact2), 
#              gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.|F')

parts = 2 ## multi-column parts to display alongside

tidy(interact2) |>
  dplyr::select(term, coefficient = estimate) |>
  mutate(part = rep(1:parts, each = ceiling(n()/parts), length.out = n()),
         row_index = rep(1:ceiling(n()/parts), length.out = n())
         ) |>
  split(~ part) |>
  Reduce(f = \(x, y) x |> select(-part) |> left_join(y, by = 'row_index')) |>
  dplyr::select(-c(row_index, part)) |>
  kable()
term.x coefficient.x term.y coefficient.y
(Intercept) 23513.113 Female1 -549.8783
IndustryBanking 46946.525 IndustryBanking:Female1 -3508.2902
IndustryConstruction 8443.728 IndustryConstruction:Female1 -4093.7365
IndustryEnergy 21340.939 IndustryEnergy:Female1 9357.9811
IndustryManufacturing 24884.761 IndustryManufacturing:Female1 -11453.9382
IndustryMining 31740.912 IndustryMining:Female1 -130.2967
IndustryOther 18604.883 IndustryOther:Female1 -8681.7468
IndustryServices 31579.784 IndustryServices:Female1 -11550.5649
IndustryTrade 14766.445 IndustryTrade:Female1 -8405.7213
IndustryTransport 39986.219 IndustryTransport:Female1 -12060.6618
# The function can be modified by ggplot commands.
cat_plot(interact2, pred = Industry, modx = Female) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

12.4.3 Continuous * Dummy

Note that R automatically adds main effects if you use multiplication operator:

Thus \[ height * female \]

translates to

\[ height + female + height*female \]

interact3 <- lm(Income ~ Age*Female, data=soep)
modelsummary(title = 'Interaction Model 2.',
             list("Income" = interact3), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.|F')
Table 12.3: Interaction Model 2.
 Income
(Intercept) 17367.198
(3438.645)
Age 612.892
(72.860)
Female1 4649.196
(5506.496)
Age × Female1 -296.045
(118.698)
Num.Obs. 1211

The visualization for non-parallel slopes looks like this:

interact_plot(interact3, pred = Age, modx = Female,
              main.title = "Each group has a different slope.")

Conclusion

The fundamental truth is that the effect of age on income is positive but not identical between males and females. Men and women at the exact same age earn different annual incomes and the gap widens over time. There is effect heterogeneity.

There are two more rather uncommon combination in which one of the main effects is dropped. Graphically, the result is two lines with same intercept and different slopes.

interact3b <- lm(Income ~ Age + Age:Female, data=soep)

interact_plot(interact3b, pred = Age, modx = Female) +
  scale_x_continuous(limits = c(0, 65)) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE)


# modelsummary(title = 'Interaction Model 2.',
#              list("Income" = interact3b,
#                   "Income" = interact3c), 
#              gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.|F')

12.4.4 Continuous * Continuous

interact4 <- lm(Income ~ Age*Satisfaction, data=soep)
interact4
#> 
#> Call:
#> lm(formula = Income ~ Age * Satisfaction, data = soep)
#> 
#> Coefficients:
#>      (Intercept)               Age      Satisfaction  Age:Satisfaction  
#>         20923.43             59.48           -378.77             62.08

The interaction of two continuous variables is harder to interpret. There are conventions to help you choose the best values of the continuous moderator for plotting predicted values. But these conventions don't always work in every situation. For example, one convention suggested by Cohen and Cohen and popularized by Aiken and West is to use three values of the moderator: the mean, the value one standard deviation above, and the value one standard deviation below the mean. This is what interact_plot() does by default.

interact_plot(interact4, pred = Age, modx = Satisfaction)

This shows you that interaction between two continuous variables works basically the same way as for a categorical and continuous variable. An interaction says that there's not a fixed offset: you need to consider both values of x1 and x2 simultaneously in order to predict y.

You can see that even with just two continuous variables, coming up with good visualizations are hard. But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact! But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time. The model doesn't have to be perfect, it just has to help you reveal a little more about your data.

12.5 Model Comparison

How to decide if the model with interaction is better than the model without interaction?

library(gt)

multiple3 <- lm(Income ~ Age + Female, data=soep)
multiple4 <- lm(Income ~ Age + Satisfaction, data=soep)

modelsummary(title = 'Comparison',
             list("Income" = multiple3,
                  "Income" = interact3,
                  "Income" = multiple4,
                  "Income" = interact4), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.',
             output = "gt") %>% 
  # column labels
  tab_spanner(label = 'Interaction', columns = c(3,5)) 
Table 12.4: Comparison
Income Interaction Income
Income Income
(Intercept) 22487.063 17367.198 20923.425 -230.558
(2764.643) (3438.645) (13829.780) (4385.352)
Age 501.348 612.892 59.480 527.982
(57.643) (72.860) (296.198) (57.826)
Female1 -8670.282 4649.196
(1345.244) (5506.496)
Age × Female1 -296.045
(118.698)
Satisfaction -378.773 2423.054
(1794.752) (450.609)
Age × Satisfaction 62.082
(38.496)
Num.Obs. 1211 1211 1211 1211
F 62.338 43.812 37.987 55.606