Paper Summary
Share...

Direct link:

Towards Scalable Vocabulary Assessment: Pilot Evidence for Reliable AI-Generated Items

Fri, April 10, 9:45 to 11:15am PDT (9:45 to 11:15am PDT), Westin Bonaventure, Floor: Lobby Level, Beaudry A

Abstract

Objectives: This study addresses a persistent challenge in literacy development: the scalable and psychometrically valid assessment of vocabulary knowledge. The objective was to develop, pilot, and evaluate an AI-enabled item generation pipeline that efficiently produces high-quality vocabulary assessment items. The overarching goal is to enable broader and more informative vocabulary testing in diverse educational settings.

Theoretical Framework: Our work is situated within a developmental framework that views vocabulary growth as a critical component of reading comprehension and academic success. We draw on lexical dimensionality theory (Author et al., 2024) and psychometric models of word knowledge, integrating cognitive developmental theory with current advances natural language processing and applications of large language models to assessment development.

Methods: Using GPT-4, we prompted the model to generate contextualized multiple-choice items for sets 27 target words stratified across lexical dimensions. Each item included a sentence stem, a synonym as the correct option, and three distractors. Items were filtered through strict linguistic and psychometric criteria which we document. We piloted the resulting item set with 238 secondary students and conducted both classical and item response theory (IRT) analyses to evaluate item quality, test dimensionality, and internal consistency.

The pilot included diverse student groups (6th–11th grade) in a mid-sized rural district. Items were based on target words sampled to cover lexical features across frequency, complexity, and polysemy. Response data were analyzed for internal consistency, item difficulty, discrimination, and dimensionality. Exploratory factor analysis and IRT modeling (2PL) were used to investigate psychometric properties.

Results: The 27-item test demonstrated excellent reliability (Cronbach’s α = 0.92) and normal score distribution. Items showed a broad range of difficulty (p-values from .17 to .79), and most had high discrimination (point-biserials > .30). Factor analysis supported unidimensionality, and 2PL IRT modeling confirmed strong variation in item informativeness. We identified problematic items (e.g., elegy) and used item-level diagnostics to flag such items for revision. Results suggest that AI-generated items can function comparably to traditional items in distinguishing vocabulary knowledge among students.

Scientific Significance: This study contributes to both developmental theory and assessment practice. First, it provides empirical evidence that AI-generated items can yield reliable, valid assessments at scale. Second, it supports a shift in vocabulary research toward within-person variability, leveraging dense sampling across lexical features. Finally, it lays the foundation for a scalable, equitable assessment tool embedded within an online, automated screening tool. A 1,000-item follow-up study launching in Fall 2025 will further test generalizability, DIF, and dimensionality, offering a publicly available item bank aligned with instructional needs.

Authors