Paper Summary
Share...

Direct link:

Comparative Analysis on Item Difficulty Predictions Between National and International Assessment Programs

Sun, April 12, 9:45 to 11:15am PDT (9:45 to 11:15am PDT), InterContinental Los Angeles Downtown, Floor: 5th Floor, Boyle Heights

Abstract

Advances in item difficulty modeling (IDM) are creating new opportunities to link observable item features, such as linguistic complexity, cognitive demand, and content domain, to psychometric outcomes like item difficulty and discrimination. Large-scale assessments like the National Assessment of Educational Progress (NAEP) and the Trends in International Mathematics and Science Study (TIMSS) are designed to monitor student achievement and inform educational policy. While these programs share commonalities in mathematics content structure, they differ in purpose, population coverage, and item frameworks. Cross-contextual comparisons of IDM can enhance our understanding of test design, validity, and construct representation across systems.

Traditional IDM approaches typically rely on expert-coded features, which, while valuable, are time-intensive and may be limited in consistency and scalability. Recent advances in natural language processing (NLP) have enabled automated extraction of item features. However, many NLP-based models rely heavily on surface-level or lexical features, often failing to capture the deeper cognitive reasoning processes that influence how students approach and solve mathematical problems. To address this gap, this study uses a large language model (LLM), GPT-4.1, to extract cognitively meaningful features from items and examine their ability to predict ordinal item difficulty. Two research questions guide the study:
1. To what extent can LLM-extracted cognitive features predict item difficulty across NAEP and TIMSS mathematics items?
2. How do feature importance patterns differ between national and international assessment contexts?

The analysis uses restricted-use mathematics items from NAEP (2017, 2022) and TIMSS (2015, 2019) for Grades 4 and 8. Item difficulty is modeled as an ordinal variable with five levels (from Very Easy to Very Difficult). Cognitive features to be extracted include the number of unknowns, computation steps, relational definitions, contextual framing (abstract vs. real-world), and equation type. These features are derived using GPT-4.1 via OpenAI's API and the ellmer package in R. Five machine learning algorithms—Naive Bayes, SVM, CART, Random Forest, and XGBoost—are trained using an 80/20 train-test split, with 5-fold cross-validation and hyperparameter tuning via grid search. Performance is assessed using both standard and extended accuracy metrics, the latter accounting for adjacent-category predictions in ordinal classification (Štěpánek et al., 2023).

We expect GPT-4.1 to generate cognitively grounded features capable of producing moderate to high predictive accuracy, with ensemble methods such as Random Forest and XGBoost performing best. Extended accuracy is expected to outperform standard accuracy, reflecting more nuanced categorization of difficulty levels. While some features may vary in predictive power across contexts, we expect shared cognitive constructs to generalize well across NAEP and TIMSS.

This study demonstrates that large language models can extract meaningful cognitive features that predict item difficulty in a generalizable way. It supports the use of LLMs for improving item development and advancing comparability across national and international assessments. By incorporating machine learning and ordinal classification approaches, the study provides a scalable, evidence-based framework for enhancing fairness, efficiency, and validity in educational measurement. These findings could inform future assessment policy by offering innovative tools to ensure alignment between test design and cognitive complexity across educational systems.

Authors