This book is Work in Progress. I appreciate your feedback to make the book better.

Chapter 11 Linear Regression

11.1 What You Deserve Is What You Get

Do you believe that people get what they deserve? The statement "what you deserve is what you get" is a controversial one. It can be interpreted such that individuals are entirely responsible for their own outcomes and that they receive exactly what they deserve based on their efforts, abilities, and choices. This view assumes a meritocratic (performance-oriented) system where everyone has an equal opportunity to succeed based on their merit, and rewards are distributed accordingly.

We investigate annual income and how it is determined. The famous Mincer Equation is single-equation model that explains wage income as a function of schooling and experience. The equation suggests that higher levels of education and experience are positively associated with earnings, and the coefficients can be estimated using statistical methods to quantify the magnitude of these relationships. When everybody has access to schooling and the equal opportunities on the labor market, better school outcomes and work experience may be determined by people efforts, abilities, and choices.

11.2 Data & Sample

We use SOEP practice data to analyse yearly income. The analysis is restricted to people who are fulltime employed (Emp == 1) in the working age (Age <= 65) who report an annual income from main job of more than one Euro (Income > 1). We analyse the most recent cross-section of the data (syear == 2019). We drop a few cases with missing information, thus conduct a complete case analysis.

library(haven)
master <- read_dta("https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/practice_en/practice_dataset_eng.dta?raw=true")

# The data comes with Stata labels that do not work with all tidyverse commands
library(sjlabelled)
soep <- remove_all_labels(master)

# Rename German to English variable names
library(tidyverse)
soep <- soep %>% 
  rename("Age" = "alter",
         "Income" = "einkommenj1",
         "NACE2" = "branche",
         "Persons in HH" = "anz_pers", 
         "Kids" = "anz_kind",
         "Education" = "bildung",
         "Health" = "gesund_org", 
         "Satisfaction" = "lebensz_org",
         "Emp" = "erwerb")

# Explicitly define the gender variable as a factor
soep <- soep %>% mutate(Female = factor(sex))

# Round annual income to two digits
soep <- soep %>% mutate(Income = round(Income, 2))

# Build the estimation sample based on the topic
soep <- soep %>% 
  filter(Emp == 1) %>% 
  filter(Age <= 65) %>% 
  filter(Income > 1) %>% 
  filter(syear == 2019)

# Conduct a complete case analysis
soep <- soep %>% filter(complete.cases(.))

The estimation sample looks like this:

DT::datatable(soep[,-c(1:3,8,13:15)], rownames = FALSE)

The descriptive statistics of the sample look like this:

library(modelsummary)
datasummary_skim(soep[,-c(1,2,8,13:15)])
Unique Missing Pct. Mean SD Min Median Max Histogram
sex 2 0 0.4 0.5 0.0 0.0 1.0
Age 47 0 45.4 11.2 18.0 47.0 65.0
Persons in HH 10 0 2.9 1.4 1.0 3.0 10.0
Kids 8 0 0.7 1.1 0.0 0.0 8.0
Education 16 0 12.9 2.8 7.0 12.0 18.0
NACE2 77 0 57.0 25.9 1.0 56.0 97.0
Health 5 0 2.4 0.9 1.0 2.0 5.0
Satisfaction 11 0 7.6 1.4 0.0 8.0 10.0
Income 1211 0 42145.2 23488.8 915.5 37337.3 257886.2
Female N %
0 779 64.3
1 432 35.7

The data is clean and ready for analysis. There are no missings. The working age ranges from 18 to 65 years. People have between 0 and 8 children. There are NACE codes for the job industry ranging from 1 to 97 in the data (theoretically from 1 to 99). But not all jobs are represented in the data (77 unique values). Health status is measured on a scale from 1 to 5 whereas life satisfaction is measured on a scale between 0 and 10. The minimum annual income is about 915 Euro. We could have restricted the income variable to the minimum wage in Germany. Assume it is 12€/h and people in full-time work 8h/day. Starting from 365 days a year, there are 104 weekends, assume there are 11 public holidays and 30 days of vacation. 220 working days remain earning a minimum yearly income of about 220*8*12 = 21120 Euro. In the data, 165 people report earning less than that.

11.3 Data Visualization

We create a scatterplot of income (on the y-axis) versus age (on the x-axis). Data points are colored by gender (blue for men, red for women). A linear regression line is added per gende.r

11.4 Simplest Regression

The simplest regression or empty model does not contain any explanatory variable.

simplest <- lm(Income ~ 1, data=soep)

library(modelsummary)
modelsummary(title = 'Empty Model.',
             list("Income" = simplest), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.',
             coef_map = cm)
Empty Model.
Income
Constant 42145.215
(674.976)
Num.Obs. 1211

The graphs below illustrate the empty model that estimates an average annual income. The left panel shows the mean income for the entire sample (sometimes referred to as grand mean), the right panel shows the data colored by gender (it can be considered two overlie plots).

11.5 Simple Regression

Simple regression suggest a one-to-one relationship between two variables. In this section we focus on the continuous outcome variable income. We relate income to three different variables, age, gender and industry. Where age is another continuous variable, gender is a binary dummy and industry is a categorical variable.

11.5.1 X is continuous

Income and age are continuous in principle, i.e. they can be any real number in an interval. Actually, age is reported in natural numbers, full years due to measurement restrictions. In the real SOEP data, a more precise estimate of age can be derived as a date difference between date of birth and date of interview. Thus creating a measure with day units.

In practice, a scale from 0 to 10 is often treated as continuous. Although technically speaking a categorical or ordinal approach would match the nature of the measure better.

simple1 <- lm(Income ~ Age, data=soep)

modelsummary(title = 'Continuous Predictor.',
             list("Income" = simple1), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.',
             coef_map = cm)
Continuous Predictor.
Income
Age 524.554
(58.487)
Constant 18341.031
(2733.469)
Num.Obs. 1211

11.5.2 X is a dummy

A dummy or binary variable describes two groups.

In the SOEP case the gender variable is named Female and coded 0 for male and 1 for female. In practice, it is likely to ease interpretation by renaming the variable to female, where 1 stands for having the feature and 0 for not having it. Dummies also come with different values like 1 and 2. Such a variable could be recoded to 0 and 1. Last but not least, note that SOEP and other studies still have a binary perspective on gender. We can imagine more gender groups, making it a categorical variable. There might be something like a continuous gender scale as well.15

simple2 <- lm(Income ~ Female, data=soep)

modelsummary(title = 'Dummy Predictor.',
             list("Income" = simple2), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.',
             coef_map = cm)
Dummy Predictor.
Income
Female -9401.118
(1383.445)
Constant 45498.875
(826.288)
Num.Obs. 1211

11.5.3 X is categorical

We investigate annual income across different industries.

NACE is the acronym used to designate the various statistical classifications of economic activities developed since 1970 in the European Union (EU). NACE provides the framework for collecting and presenting a large range of statistical data according to economic activity in the fields of economic statistics (e.g. business statistics, labour market, national accounts) and in other statistical domains.

In level 2 of NACE there are 88 divisions identified by two-digit numerical codes (01 to 99). In the following we built a categorical variable industry from the level 2 NACE information in SOEP:

soep <- soep %>% 
  mutate(industry = case_when(NACE2 %in% c(1,2,3) ~ "Agriculture",
                              NACE2 %in% c(5:9) ~ "Mining",
                              NACE2 %in% c(10:32) ~ "Manufacturing",
                              NACE2 %in% c(35:38) ~ "Energy",
                              NACE2 %in% c(41:43) ~ "Construction",
                              NACE2 %in% c(50,51,52,55) ~ "Trade",
                              NACE2 %in% c(60,61,62,63,64) ~ "Transport",
                              NACE2 %in% c(65,66,67) ~ "Banking",
                              NACE2 %in% c(70,71,72,73,74,75,80,85,90,91,92,93,95,98,99) ~ "Services",
                                     TRUE ~ "Other"))

table(soep$industry)
#> 
#>   Agriculture       Banking  Construction        Energy Manufacturing 
#>            14             9            80            26           299 
#>        Mining         Other      Services         Trade     Transport 
#>             3           539           134            34            73
simple3 <- lm(Income ~ industry, data=soep)

modelsummary(title = 'Categorical Predictor.',
             list("Income" = simple3), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.',
             coef_map = cm)
Categorical Predictor.
Income
Banking 46123.264
(9569.929)
Construction 8057.920
(6489.108)
Energy 22774.586
(7425.221)
Manufacturing 22032.617
(6124.951)
Mining 31592.741
(14250.493)
Transport 36264.336
(6535.281)
Trade 11947.600
(7112.908)
Services 25698.418
(6291.360)
Other 14521.498
(6063.652)
Constant 23434.559
(5986.405)
Num.Obs. 1211

Note that with a categorical variable, one category is missing in the output by default (the so called base level). In this case it is Agriculture. All coefficients are in comparison with the average annual income in Agriculture. All coefficients are positive, indicating that all industries earn more than the Agriculture sector.

11.5.4 X is categorical, is it?

Actually, categorical variables are split into multiple dummy variable during the estimation process.

Truly Dedicated

We code the equivalence between linear regression with a categorical variable coded as a factor in R and a categorical variable split into multiple dummy variables.

# All variables are internally coded as "double"
glimpse(mtcars)
#> Rows: 32
#> Columns: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
#> $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
#> $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
#> $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
#> $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

# Cars have either 4, 6 or 8 cylinders
table(mtcars$cyl)
#> 
#>  4  6  8 
#> 11  7 14

There are

# Factor coding: Use built in R style 
mtcars <- mtcars %>% 
  mutate(cyl_fct = as_factor(cyl))

# Creating a dummy for each group 
mtcars <- mtcars %>% 
  mutate(cyl4 = ifelse(cyl == 4, 1, 0),
         cyl6 = ifelse(cyl == 6, 1, 0),
         cyl8 = ifelse(cyl == 8, 1, 0))

# One hot encoding: every level of factor/categorical get its own column/dummy

# Dummy coding: there are k-1 dummies, because one is redundant 

The following types of regression are possible. First, treat cyl like a continuous variable. This might or might not make sense. R assumes, there are cars with all values of cylinders.

# Treated like continuous
model1 <- lm(mpg ~ cyl, data=mtcars)

# Treated like categorical (not showing base category)
model2 <- lm(mpg ~ cyl_fct, data=mtcars)

# One hot coding (all included, but one is NA)
model3 <- lm(mpg ~ cyl4 + cyl6 + cyl8, data=mtcars)

# Dummy coding (same coefficients, no NA)
model4 <- lm(mpg ~ cyl4 + cyl6, data=mtcars)

modelsummary(title = 'Categorical Predictor.',
             list("MPG" = model1,
                  "MPG" = model2,
                  "MPG" = model3,
                  "MPG" = model4), 
             statistic = NULL,
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.|F')
Categorical Predictor.
MPG MPG MPG MPG
(Intercept) 37.885 26.664 15.100 15.100
cyl -2.876
cyl_fct6 -6.921
cyl_fct8 -11.564
cyl4 11.564 11.564
cyl6 4.643 4.643
Num.Obs. 32 32 32 32
# Connection

# cylfct6 is -6.9 less than invisible base category
lm(mpg ~ cyl_fct, data=mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ cyl_fct, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)     cyl_fct6     cyl_fct8  
#>      26.664       -6.921      -11.564

# recalculate this as coefficient of cyl6 - cyl4 
lm(mpg ~ cyl4 + cyl6, data=mtcars)$coefficients[3]-lm(mpg ~ cyl4 + cyl6, data=mtcars)$coefficients[2]
#>      cyl6 
#> -6.920779

11.6 Parallel Slopes

Parallel Slopes is a special case of a multiple regression where there are multiple input variables that describe or explain an outcome. In particular, one of the variables is continuous (in this case age) and the other is a dummy variable (in this case gender). The resulting graph shows two parallel linear regression lines shifted by the dummy variable.

11.6.1 X is continuous + dummy

In contrast to the first Figure, we know consider all data at once and recognize gender as a factor influencing the annual income. The dummy variable gender offsets the age-income relationship.

parallel1 <- lm(Income ~ Age + Female, data=soep)

modelsummary(title = 'Categorical Predictor.',
             list("Income" = parallel1), 
             gof_omit = 'R2|AIC|BIC|RMSE|Log.Lik.',
             coef_map = cm)
Categorical Predictor.
Income
Age 501.348
(57.643)
Female -8670.282
(1345.244)
Constant 22487.063
(2764.643)
Num.Obs. 1211

11.7 Model Comparison

multiple <- lm(Income ~ Age + Female + industry, data=soep)

modelsummary(title = 'Linear regression models compared.',
             list("Income" = simplest,
                  "Income" = simple1,
                  "Income" = simple2,
                  "Income" = simple3,
                  "Income" = parallel1,
                  "Income" = multiple), 
             
             gof_omit = 'RMSE|Log.Lik.|F|AIC|BIC',
             metrics = "all",
             coef_map = cm)
Linear regression models compared.
Income Income Income Income Income Income
Age 524.554 501.348 478.369
(58.487) (57.643) (54.886)
Female -9401.118 -8670.282 -9164.068
(1383.445) (1345.244) (1323.594)
Banking 46123.264 46653.149
(9569.929) (9090.316)
Construction 8057.920 8349.413
(6489.108) (6164.261)
Energy 22774.586 23587.586
(7425.221) (7053.065)
Manufacturing 22032.617 23113.979
(6124.951) (5819.116)
Mining 31592.741 31652.597
(14250.493) (13539.296)
Transport 36264.336 37942.568
(6535.281) (6210.851)
Trade 11947.600 14909.762
(7112.908) (6761.498)
Services 25698.418 28857.012
(6291.360) (5993.555)
Other 14521.498 17854.146
(6063.652) (5773.785)
Constant 42145.215 18341.031 45498.875 23434.559 22487.063 2670.410
(674.976) (2733.469) (826.288) (5986.405) (2764.643) (6232.343)
Num.Obs. 1211 1211 1211 1211 1211 1211
R2 0.000 0.062 0.037 0.097 0.094 0.187
R2 Adj. 0.000 0.062 0.036 0.091 0.092 0.180

Definition

R square (R2) is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variable(s) in a regression model. It is also known as the coefficient of determination.

R2 measures how well the regression model fits the data. R2 can range from 0 to 1, where 0 indicates that none of the variance in the dependent variable is explained by the independent variable(s), and 1 indicates that all of the variance in the dependent variable is explained by the independent variable(s). A higher R2 indicates a better fit between the model and the data, meaning that more of the variability in the dependent variable can be explained by the independent variable(s) in the model. The formula for R2 is:

\[R2 = 1 - \frac{SSR}{SST} = 1 - \frac{\text{sum of squares residuals}}{\text{total sum of squares}} = 1 - \frac{\displaystyle\sum \left(\hat{y}- \overline{y}\right)^2}{\displaystyle\sum \left(y - \overline{y}\right)^2} \]

A good R2 – In some fields like social sciences or economics, an R2 value of 0.3 or higher may be considered a good fit for a model. In other fields like physics or engineering, a higher R2 value of 0.7 or above may be necessary to demonstrate a good fit.

R2 inflation – R2 usually increases with sample size. In least squares regression using typical data, R2 is at least weakly increasing with increases in the number of regressors in the model. Because increases in the number of regressors increase the value of R2, R2 alone cannot be used as a meaningful comparison of models with very different numbers of independent variables.

Definition

Adjusted R2 is a modified version of R2 that takes into account the number of independent variables in a regression model.

\[\text{Adjusted R2} = 1 - \left[ (1 - R2) \cdot \frac{n - 1}{n - k} \right] \tag{with intercept}\] \[\text{Adjusted R2} = 1 - \left[ (1 - R2) \cdot \frac{n - 1}{n - k - 1} \right] \tag{no intercept}\]

11.8 Transform to Perform

While a lot of statistics and econometrics deals with linear relationships, we live in a very non-linear world. By taking the logarithm of your data, you can reduce the range of values and make it easier to see patterns and relationships. Log transformation is a common technique to deal with skewed or highly variable data, such as income. Income data often have a long right tail, meaning there are few individuals with very high incomes, but many with lower incomes (left panel). When modeling income in a regression, this can cause issues because the distribution violates the normality assumption that underpins many regression techniques.

Remember some key points of log:

  • The log can only be calculated for numbers > 0.
  • The log of values between 0 and 1 is negative.
  • The log of values above 1 is positive.
  • The log of 1 is 0.

How to do logs in R?

# Logs in R
x <- c(0.1,0.5,1,2,5)
x_log <- log(x)

data.frame(x, x_log)
#>     x      x_log
#> 1 0.1 -2.3025851
#> 2 0.5 -0.6931472
#> 3 1.0  0.0000000
#> 4 2.0  0.6931472
#> 5 5.0  1.6094379

It's natural.

The default of log() is the natural logarithm. There are other logs as well, e.g. with base 10 log(x, base = exp(10)).

The above specification of income versus age as two non-transformed variables is also known as a linear-linear model or level-level model. You can have a log transformation on the left, the right or both sides. All log-transformations change the interpretation of the regression coefficient. We explore these options in the following.

The log-linear or log-level model.

log_linear <- lm(log(Income) ~ Age + Female + industry, data=soep)
log_log <- lm(log(Income) ~ log(Age) + Female + industry, data=soep)
linear_log <- lm(Income ~ log(Age) + Female + industry, data=soep)
Linear regression with log transformations.
Income Log (Income) Log (Income) Income
Age 478.369 0.014
(54.886) (0.001)
Log (Age) 0.624 20595.612
(0.058) (2226.835)
Num.Obs. 1211 1211 1211 1211
R2 0.187 0.177 0.189 0.193
R2 Adj. 0.180 0.169 0.182 0.186

Regression coefficients interpretation.

Here are the interpretations:

  • Linear-Linear: For each additional year of age, we expect 478 € more annual income on average. Easy peasy.
  • Log-Linear (semi-elasticity): For each additional year of age, we expect annual income to increase by \((e^{0.014} - 1) * 100 = 1.4\) % on average. Starting from the average annual income of 42145 € one more year in age increases income by 590 €.
  • Log-Log (elasticity): For each additional 1% increase in age, we expect the annual income to increase by 0.62 % €. For a 30-year-old earning average annual income, a 1% increase in age means 0.3 years or 3.6 months, increasing annual income by 261 €.
  • Linear-Log: For each additional 1% increase in age, we expect the annual income to increase by 206 € For a 30-year-old, additional 1% or 3.6 month are expected to increase annual income by 206 €.

Of course other mathematical transformations are possible and some are common, for example the square root or polynomial transformation (the power of two or three). Stutzer and Frey (2008) study the effect of commuting time (and commuting time squared) on life satisfaction (Tables 1 to 4).


  1. The same consideration is so handedness. Normally, people are classified a left-hander or right-hander. But you can make a case for ambidextrous people using both hands equally good as well as a mixture of hand usage, making handedness a continuous concept.↩︎