Paper Summary
Share...

Direct link:

Using New Items Operationally in Adaptive Tests

Mon, May 1, 2:15 to 3:45pm, Henry B. Gonzalez Convention Center, Floor: River Level, Room 7A

Abstract

The NMAT-by-GMAC™ exam is a graduate business school admission test used in India. Before 2015, the NMAT exam was used by one school exclusively. It was based on the linear-on-the-fly (LOFT) design, and the item pool for each administration contained a large number of new items. To make the test difficulty consistent across forms, new items were assigned difficulty ratings such as “Easy,” “Medium,” or “Difficult” by the subject-matter experts (SME). The test blueprint specified the number of items by difficulty level by content domain. New item calibration and candidate scoring were conducted after the testing window closed, and form difficulty variations were adjusted through an IRT post-equating. Up to 2015, the LOFT design administration worked effectively because the item pool was sculpted to yield higher test information around the university’s cut score, therefore leading to higher decision constancy for admission decisions.
In 2015, GMAC acquired the NMAT exam, and 17 additional universities in India began using NMAT-by-GMAC™ scores for their admissions process. Because these schools don’t necessarily look for candidates with the same academic/performance profile, the NMAT-by-GMAT™ cut scores may differ substantially across schools. Hence, the test needs to be redesigned to offer the same level of measurement precision across a much wider score spectrum/range. An adaptive design for the NMAT-by-GMAC™ exam could satisfy the schools’ needs better than the current LOFT design. There are two operational challenges to be overcome, however. First, new items don’t have IRT parameters—the difficulty rating method won’t work for CAT. Second, with CAT administration, the observed response data for each item are often derived from an ability range that is much narrower than the population, and it could be barrier for obtaining unbiased parameter estimates for new items. For the first challenge, the proposed solution is to impute parameters for new items using the parameter mean of existing items under the same content domain with the same difficulty rating. Those imputed values, along with previously calibrated parameters for linking items, will be used only to compute interim thetas and to select items adaptively, but the post-administration calibrated and scaled item parameters will be used in the final scoring. For the second challenge, the proposed solution is to use a hybrid design, combining a LOFT component with a CAT component and only use LOFT data for new item calibration.

The purpose of this study was to evaluate this hybrid design thoroughly with respect to parameter estimation accuracy, item usage, and test efficiency. It was compared against four other designs (LOFT, MST, Random+CAT, and CAT) and across three exams (Language Skills, Quantitative Skills, and Logical Reasoning), two item pools (600 and 720 items), three proportions of linking items (20%, 50%, and 80%), and two calibration strategies (all vs. half data). Findings from this study support the innovative features embedded in the proposed design. They also provide test developers and practitioners with a real-world example of an innovative approach that takes advantage of both LOFT and CAT design and addresses the challenges related with using new items in adaptive testing

Authors