Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Purpose
Despite increased utilization of finite mixture modeling and significant progress in machine learning imputation, there remains a critical gap in the methodological literature. Specifically, there has been no investigation that successfully integrates these two concepts and assesses their effectiveness under varying conditions. This lacuna in knowledge represents a significant research challenge that requires attention. Accordingly, the primary objective of this study is to compare and evaluate the performance of machine learning-based imputation methods that have not previously been utilized in growth mixture models (GMMs).
Theoretical Framework
Ignorable missingness assumes missingness is independent of observed and unobserved data, categorized as Missing Completely At Random (MCAR) or Missing At Random (MAR). Enders and Gottschall (2011) noted that even with ignorable missingness, imputation can attenuate differences in latent class means and variances. Existing methods often produce biased results with small sample sizes and underestimate slope variance and intercept slope covariances (Lee & Harring, 2023).
This study explores four imputation methods: mean imputation (MI), partial variational autoencoder (PVAE), MNAR partial VAE (MNAR-PVAE), and random forest (MissForest). MI fills in missing values with the mean of available data. PVAE, a deep learning model, trains an encoder-decoder network to learn the data structure and generate imputed values. MNAR-PVAE extends PVAE by estimating the probability of missingness and using a combined loss function. MissForest, a powerful algorithm for various missing data scenarios, trains random forest models to predict missing values (Stekhoven & Bühlmann, 2012). Based on this, we hypothesize that:
Machine learning-based imputations outperform mean imputation in growth mixture models.
MNAR-PVAE outperforms PVAE and MissForest in handling non-ignorable missing data in growth mixture models.
Methods/Data Sources
Our proposed project has two components: a Monte Carlo simulation and an empirical example.
Simulation: Table 1 outlines the simulated conditions. We will use Bayesian estimation with an inverse Wishart prior specification for the variances of the growth factors (Liu et al., 2016). We will examine the means and medians of the absolute relative bias and standard error bias rates for all parameter estimates, grouped by imputation method. To identify specific differences, we will conduct a decision tree analysis (Collier et al., 2022).
Example: Table 2 shows findings from our application of the imputation methods using GMMs on data from the Early Childhood Longitudinal Study–Kindergarten Cohort. Following Grimm et al. (2017), we performed a k-fold cross-validation procedure for class enumeration in GMMs, demonstrating the dataset's significance for methodological research.
Scholarly Significance
We hypothesize that random forest models are effective with smaller datasets and require less data for accurate results compared to deep learning models. Specifically, for small sample sizes, MissForest may outperform both PVAE and MNAR-PVAE. In cases of Missing Not At Random (MNAR) data, we expect MNAR-PVAE to be superior to PVAE due to its explicit modeling of the missing data mechanism. These insights could improve the selection of imputation methods for various data scenarios, enhancing the reliability of statistical analyses in complex datasets.