Individual Submission Summary
Share...

Direct link:

Beyond Selection Bias and Missing Data: Recovering True Prevalences

Mon, August 10, 4:00 to 5:30pm, TBA

Abstract

Descriptive quantities such as prevalences and means are foundational to applied statistical research, yet the assumptions underlying the data-generating mechanisms that produce them are rarely interrogated. Missing data — both unit-missing (sample selection) and item-missing (nonresponse to specific items) — can systematically distort these quantities, making even simple descriptive estimates susceptible to bias. In this paper, we formalize assumptions about missingness mechanisms using causal graphs for missing data (m-DAGs), unifying the treatment of unit- and item-missingness within a single graphical framework. We show how m-DAGs make the causal structure of missingness transparent and thus actionable, clarifying when conventional approaches such as complete-case analysis and multiple imputation succeed and when they fail. Through a Monte Carlo simulation study across five scenarios — spanning Missing Completely at Random (MCAR), Missing at Random (MAR), sample selection, combined sample selection and item-missingness, and Missing Not at Random (MNAR) — we evaluate the performance of naive (listwise deletion), multiple imputation, and debiased estimators based on the law of total probability leveraging external data. We demonstrate that multiple imputation, while effective under pure MAR conditions, fails when sample selection is present or when missingness depends on the outcome itself. For recoverable scenarios, reweighting estimators using aggregate-level auxiliary data (e.g., from censuses or administrative registers) eliminate bias. For MNAR scenarios, where standard methods cannot recover the true prevalence, we generalize Quantitative and Probabilistic Bias Analysis (QBA/PBA) to univariate prevalence estimation and demonstrate the value of small-scale validation studies targeting nonrespondents. Our findings underscore that even purely descriptive research requires causal thinking: understanding why data are missing is essential to measuring what we intend to measure.

Author