Paper Summary
Share...

Direct link:

A Comparison of Methods of Automated Indicator Selection for Latent Class Analysis

Fri, April 12, 4:55 to 6:25pm, Philadelphia Marriott Downtown, Floor: Level 4, Room 403

Abstract

Researchers choose the number of classes in latent class analysis (LCA) by comparing fit information across models with different numbers of classes (Nylund et al., 2007). The chosen number and configuration of latent classes are contingent upon the set of indicators included in the model. This is not of much concern when there are a small number of indicators available. However, large numbers of potential indicators are widely available in system log data collected from virtual learning environments, such as learning management systems, intelligent tutoring systems, and massive open online courses (MOOCs). For example, a MOOC may collect the number of videos started, completed, pauses, rewinds, and skips. Additional indicators can be created by counting how many students completed different percentages of the video (e.g., 25%, 50%). All these indicators are potentially informative for class enumeration, but some may be redundant or irrelevant. A relevant indicator is one that has unique population parameters that differ between classes. A redundant indicator is one that has the same population as one item in the set of relevant indicators. An irrelevant indicator has the same population parameters across classes, and therefore cannot are not useful to differentiate classes.

Three approaches have been proposed to select indicators in LCA: The indicator selection method by Dean and Raftery (2010) is an algorithm that iteratively compares two models to assess an indicator’s usefulness, where one assumes an indicator is useful for clustering and the other not. Fop et al. (2017) proposed a swap-stepwise selection algorithm, including the choices of inclusion, exclusion, and swapping steps. Marbac and Sedki (2017) proposed a two-step method that selects indicators and enumerates classes by finding the maximum integrated complete-data likelihood criterion (MICL). However, these three methods have not been extensively compared in situations when the number of indicators is large and there are varying percentages of irrelevant and redundant indicators.

We will complete a Monte Carlo simulation to address these questions: (1) How frequently do the three methods identify the correct number of latent classes? (2) How frequently do the methods select relevant indicators? (3) How frequently do the methods exclude irrelevant and redundant indicators?

We will manipulate the following conditions: 1) True classes for the population (2 and 4), 2), number of relevant indicators (10, 50), 3) sample size (1,000, 4,000), 4) number of redundant indicators (0, 5, 10), 3) number of irrelevant indicators (5, 20, 100), 4) scale of indicators (binary and 5-point Likert. One hundred datasets will be simulated for each of the 144 conditions and analyzed with three variable selection methods. The outcomes are the percentages of correct class enumeration, relevant indicators included, and irrelevant and redundant indicators excluded. An applied example will also be provided using data from a virtual learning environment for Algebra. The results will inform applied researchers interested in using LCA with large-scale datasets.

Authors