This book is Work in Progress. I appreciate your feedback to make the book better.

Chapter 8 Missing Data

Well, missing data refers to the absence of values in a dataset where information is expected.

It can introduce bias and impact the validity of statistical analyses.

8.1 Types of Missing Data

Understanding the nature of missing data is crucial for selecting appropriate methods to handle it in statistical analyses.

  • Missing Completely at Random (MCAR)
  • Missing at Random (MAR)
  • Missing Not at Random (MNAR)

8.1.1 Missing Completely at Random (MCAR)

Missing data is completely random if the likelihood of missing a value is the same for all observations, regardless of the observed values.

Imagine you are conducting a survey on people's favorite colors, and some respondents accidentally skip the question due to distractions or oversights. If the likelihood of skipping the question is the same for all respondents, regardless of their actual favorite color, then the missing data is considered completely random (MCAR).

8.1.2 Missing at Random (MAR)

Missing data is at random if the likelihood of missing a value depends on observed information but not on the unobserved (missing) values.

In a study on income, respondents might be more likely to skip the income question if they are younger. However, once you know the respondent's age, the likelihood of missing the income information is the same for people with the same age. The missing data is at random because it depends on observed information (age) but not on the unobserved variable (income).

8.1.3 Missing Not at Random (MNAR)

Missing data is not at random if the likelihood of missing a value is related to the unobserved (missing) values themselves.

Consider a health survey where individuals with higher levels of stress are less likely to report their stress levels accurately. In this case, the missing data (unreported stress levels) is related to the unobserved variable (stress). This scenario is considered missing not at random (MNAR) because the likelihood of missing data depends on the unobserved variable.

8.2 Causes of Missing Data

-   Human error during data entry

-   Technical issues in data collection

-   Intentional non-response by participants

8.3 Causes of Missing Data

Intentional Non-response by Participants:

  • Example: Respondents may choose not to answer specific questions due to privacy concerns, sensitivity, or personal reasons.

  • Impact: Intentional non-response can introduce bias and affect the representativeness of the collected data.

8.4 Causes of Missing Data

Data Cleaning and Preprocessing Errors:

  • Example: Mistakes made during data cleaning or preprocessing stages can inadvertently introduce missing values.

  • Impact: Errors in data preparation can affect the accuracy of downstream analyses and interpretations.

8.5 Causes of Missing Data

Changes in Measurement Protocols:

  • Example: Modifications in measurement methods or instruments over time can lead to inconsistencies and missing data.

  • Impact: Changes in protocols may result in data incompatibility, making it challenging to analyze trends over different periods.

What happened in SOEP for firm size? See Paneldata.org or Documentation

8.6 Missing Data in R

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.