Introduction
panelbuild provides tools for auditing, validating, and
preparing panel datasets before statistical analysis.
Panel datasets often contain duplicate unit-time observations, missing time periods, irregular gaps, and imbalance. These issues can affect fixed effects models, difference-in-differences designs, event studies, and other panel-data methods.
The goal of panelbuild is to help users identify these
issues before estimation.
Example panel dataset
panelbuild includes a small example dataset called
example_panel.
data(example_panel)
example_panel
#> id year outcome treatment
#> 1 1 2020 10 0
#> 2 1 2021 12 1
#> 3 1 2021 13 1
#> 4 2 2020 20 0
#> 5 2 2022 25 1
#> 6 3 2020 30 0
#> 7 3 2021 31 0
#> 8 3 2022 32 1
#> 9 3 2023 33 1The dataset intentionally includes:
- a duplicate unit-time observation
- missing unit-time cells
- an unbalanced panel structure
This makes it useful for demonstrating panel-data diagnostics.
Audit the panel
The main function is audit_panel().
audit_panel(example_panel, id = id, time = year)
#> Panel audit
#>
#> Data: example_panel
#> Unit variable: id
#> Time variable: year
#>
#> Units: 3
#> Time periods: 4
#> Observed rows: 9
#> Observed id-time cells: 8
#> Expected id-time cells: 12
#> Missing id-time cells: 4
#> Duplicate id-time cells: 1
#> Balanced panel: NoThis gives a quick overview of the panel structure, including whether the panel is balanced and whether there are missing or duplicate unit-time cells.
Find duplicate observations
Duplicate unit-time observations are a common problem in panel datasets.
duplicate_summary(example_panel, id = id, time = year)
#> # A tibble: 1 × 3
#> id panelbuild_duplicate_cells panelbuild_duplicate_extra_rows
#> <dbl> <int> <int>
#> 1 1 1 1Summarize gaps
gap_summary() identifies missing time periods by panel
unit.
gap_summary(example_panel, id = id, time = year)
#> # A tibble: 2 × 2
#> id panelbuild_missing_periods
#> <dbl> <int>
#> 1 1 2
#> 2 2 2Flag row-level issues
flag_panel_issues() adds diagnostic flags to the
data.
flag_panel_issues(example_panel, id = id, time = year)
#> # A tibble: 9 × 7
#> id year outcome treatment panelbuild_row_id panelbuild_id_time_n
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 1 2020 10 0 1 1
#> 2 1 2021 12 1 2 2
#> 3 1 2021 13 1 3 2
#> 4 2 2020 20 0 4 1
#> 5 2 2022 25 1 5 1
#> 6 3 2020 30 0 6 1
#> 7 3 2021 31 0 7 1
#> 8 3 2022 32 1 8 1
#> 9 3 2023 33 1 9 1
#> # ℹ 1 more variable: panelbuild_duplicate_cell <lgl>Complete a panel grid
complete_panel() creates a complete unit-time grid. It
does not impute missing outcome values.
Because complete_panel() requires unique unit-time
cells, we first remove duplicate id-time observations from the example
dataset.
example_panel_unique <- example_panel |>
dplyr::distinct(id, year, .keep_all = TRUE)
complete_panel(example_panel_unique, id = id, time = year)
#> # A tibble: 12 × 7
#> id year outcome treatment panelbuild_original_row panelbuild_completed_…¹
#> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 1 2020 10 0 TRUE FALSE
#> 2 1 2021 12 1 TRUE FALSE
#> 3 1 2022 NA NA FALSE TRUE
#> 4 1 2023 NA NA FALSE TRUE
#> 5 2 2020 20 0 TRUE FALSE
#> 6 2 2021 NA NA FALSE TRUE
#> 7 2 2022 25 1 TRUE FALSE
#> 8 2 2023 NA NA FALSE TRUE
#> 9 3 2020 30 0 TRUE FALSE
#> 10 3 2021 31 0 TRUE FALSE
#> 11 3 2022 32 1 TRUE FALSE
#> 12 3 2023 33 1 TRUE FALSE
#> # ℹ abbreviated name: ¹panelbuild_completed_cell
#> # ℹ 1 more variable: panelbuild_audit_action <chr>Typical workflow
A typical panelbuild workflow is:
library(panelbuild)
audit_panel(my_data, id = unit_id, time = year)
duplicate_summary(my_data, id = unit_id, time = year)
gap_summary(my_data, id = unit_id, time = year)
clean_data <- my_data |>
dplyr::distinct(unit_id, year, .keep_all = TRUE)
complete_panel(clean_data, id = unit_id, time = year)