Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
1. Purpose
Traditional item development for Computerized Adaptive Testing (CAT; Chang, 2015) is costly and slow. While generative AI automates item generation, it intensifies validation challenges—especially for overlooked incidental content. Existing automated methods (e.g., BLEU, cosine) lack semantic depth and robustness to wording changes (Xu et al., 2024). This research introduces the LLM-powered framework for Automated Item Content Analysis (AICA), quantifying structural and semantic similarity to identify incidental redundancies. By integrating this analysis into CAT with similarity-based constraints, we enhance item bank diversity, improve content balance, and enable intelligent exposure control—boosting test engagement and psychometric quality.
2. Framework
Incidental content similarity is conceptualized through two novel components: (1) Structured decomposition (S_1): Surface-level features (item format, sentence structure, lexical patterns). (2) Semantic relatedness (S_2): Conceptual depth (overlapping themes, lexical semantics, emotional tone). Pairwise similarity S_ij is quantified as: S_ij=w_1 S_1ij+w_2 S_2ij (w_1+w_2=1), where weights w_1, w_2 are domain-adjusted. This composite metric enables systematic redundancy detection for practical CAT design.
3. Method
We proposed two well-designed prompt templates (using Claude Sonnet 4) to quantify incidental content similarity via structural decomposition and semantic relatedness analyses (Table 2.4). Similarity matrices were combined into a final metric, converted to distances for hierarchical clustering. The framework was validated through: Human evaluation (agreement ratings + rank correlations: Kendall’s τ, Spearman’s ρ) CAT simulations (2,000 examinees × 50 replications) comparing cluster-constrained maximum Fisher information (CMFI) against random/MFI selection, evaluating RMSE/bias and cluster diversity. This constrained item selection was implemented using the two-phase item selection strategy proposed by Cheng et al. (2007) with the maximum priority index (MPI) method introduced by Cheng & Chang (2009).
4. Data Sources
This study evaluates the framework using 36 Likert-type items from the Experiences in Close Relationships scale (ECR; Kilmen & Bulut, 2025), assessing romantic attachment. Response data from 51,491 multinational participants were sourced via the Open-Source Psychometric Project. Structural/semantic weights (e.g., w_1=0.3, w_2=0.7) were applied to quantify incidental similarity.
5. Preliminary Results
Exploratory analysis revealed consistent similarity distributions and dimensional alignment (Figure 2.5-2.6), with LLM chain-of-thought rationales validating edge-case judgments. Human evaluation used randomized item pairs, and internal pilot showed acceptable consistency. Broader human evaluation is planned to validate reliability and generalizability. For CAT simulation, a graded response model was built using 38,933 cleaned responses. Items were clustered into four groups via LLM-based similarity metrics, ensuring higher intra-cluster homogeneity and inter-cluster heterogeneity. Compared to unconstrained methods, the cluster-constrained selection method maintained ability estimation accuracy while presenting more varied item content, enhancing test experience (Table 2.5).
6. Significance
This study pioneers the use of LLMs for quantifying incidental content similarity—a critical gap in automated item validation. By integrating semantic-structural analysis into operational CAT via cluster constraints, we can achieve a greater content diversity without compromising psychometric accuracy. This framework not only mitigates test-taker fatigue but also establishes a scalable quality assurance protocol for AI-generated item banks, advancing adaptive testing toward greater engagement and validity.