Paper Summary
Share...

Direct link:

Text-Based Approaches to Classification of Item Difficulty Levels

Fri, April 10, 1:45 to 3:15pm PDT (1:45 to 3:15pm PDT), InterContinental Los Angeles Downtown, Floor: 5th Floor, Wilshire Grand Ballroom I

Abstract

1. Purpose
The estimation of item difficulty is crucial in large-scale assessment development. Item difficulty is a key parameter for evaluating item quality, ensuring that items can distinguish test-takers of varying abilities. A well-distributed range of item difficulties is essential for item banks, especially for computerized adaptive testing (CAT) based on item response theory. Information about the difficulty of new items is also important for constructing parallel operational test forms. However, current practices often assign new items to field-test forms without empirical difficulty data, potentially resulting in uneven test experiences for students. Recent studies have explored text-based approaches to item difficulty modeling, with some promising results (AlKhuzaey et al., 2023; Hsu et al., 2018; Li et al., 2025; Yaneva et al., 2019). However, estimation errors remain a challenge when using these methods.
2. Framework
Item difficulty is typically estimated on a continuous scale, such as logits or p-values. This study focuses on classifying item difficulty levels to support the construction of parallel field-test forms. The dataset consists of items from the College Board SAT practice item bank, including both item texts and difficulty labels. The origin of the difficulty labels is not specified, but for demonstration purposes, they are assumed to be valid.
3. Method
A range of pre-trained Transformer encoder models are trained to classify items into difficulty levels. These include models from the BERT family (bert-base-uncased, bert-large-uncased), RoBERTa (roberta-base, roberta-large), ALBERT (albert-base-v2), DeBERTa-v3 (deberta-v3-base, deberta-v3-large), and ELECTRA (electra-base-discriminator, electra-small-discriminator). Quadratic weighted kappa (QWK) is used as the evaluation metric, as item difficulty levels are ordinal. The dataset is split into training, validation, and testing sets for model training and hyperparameter tuning. The best-performing model is selected based on QWK.
4. Results and Significance
It is expected that RoBERTa and DeBERTa will outperform other models, consistent with their success in related item quality tasks (Xu et al., 2025). This study demonstrates the application of recent encoder-based language models for classifying item difficulty and provides empirical evidence on their accuracy for supporting parallel test form construction.

Authors