This book is Work in Progress. I appreciate your feedback to make the book better.

Chapter 3 Tabular Data

Tabular data is the most common type of data. The ability to organize information systematically into rows and columns has been a cornerstone of data-driven decision-making for centuries. From handwritten ledgers to digital spreadsheets, tabular data has served as the foundation for understanding, managing, and extracting insights from a wide range of datasets.

Since tabular data is easy to understand and easy to handle, there are approaches to transform non-tabular data like text data into a tabular format. In the R realm this has been termed the tidy data principle. Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.⁹

In this chapter, we embark on a journey through the world of tabular data, exploring its diverse facets and applications. We will delve into the art of manipulating, analyzing, and visualizing data in this familiar format, unlocking the potential it holds for uncovering hidden patterns and informing critical decisions. But our exploration goes further, as we introduce you to a special and dynamic subset of tabular data that adds an extra layer of complexity and depth to your analytical toolkit — panel data.

Definition

Tidy data principles are:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

3.1 Types of Tabular Data

How can there be different types of tabular data when a table always consists of rows and columns?

3.1.1 Cross-section

Look at the following example. This data is collected in 2024. Each row represents a different person (or unit), i.e. there are 6 women at different age and their respective income.

Cross-sectional data is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the one point or period of time. It can answer questions about levels: "How many people are poor in 2023 in Germany?" and questions about differences: "How are men and women affected by poverty?".

3.1.2 Repeated cross-section

Cross-sectional survey data are data for a single point in time. Repeated cross-sectional data are created where a survey (or measurement) is administered to a new sample of interviewees at successive time points. For an annual survey, this means that respondents in one year will be different people to those in a prior year. Such data can either be analysed cross-sectionally, by looking at one survey year, or combined for analysis over time.

This type of data can answer questions about trends: "Has poverty increased or decreased?".

3.1.3 Time series

Time series is data on a single subject at multiple points in time. Most commonly, data is collected at successive equally spaced points in time e.g. daily, annually. If data is collected annually, it's likely to be a survey study. If data is collected more frequently, e.g. daily, it's likely to be meteorology or finance. A time series is very frequently plotted via a run chart (which is a temporal line chart).

Time series data can answer questions about trends: "Is there a seasonal component in unemployment?".

3.1.4 Panel data

Panel data are observations for the same subjects over time. Subjects can be people, households, firms or countries. Panel data are a subset of longitudinal data. Key components are the panel identifier: person (id) and time (year). Every row is a person-year combination (so called long format).

With panel data we know the time-ordering of events. Panel data allow to identify causal effects under weaker assumptions (compared to cross-sectional data). Panel data can answer questions about change: "How many people went in and out of poverty?".

Read more about tidy data.↩︎