Paper Summary
Share...

Direct link:

Comparing Rating Quality between GenAI and Human Raters for ELL Writing Assessment: An MFRM Analysis

Sat, April 11, 11:45am to 1:15pm PDT (11:45am to 1:15pm PDT), InterContinental Los Angeles Downtown, Floor: 6th Floor, Mission

Abstract

This study investigated systematic differences between human and GenAI raters in ELL writing assessment, using Many-Faceted Rasch Measurement. Analyzing 140 essays from the ICNALE corpus rated by 10 human and 10 GenAI raters across 10 analytic criteria revealed stark contrasts: human raters demonstrated significant overfitting (infit MSQ = 1.14-2.44), while GenAI raters showed extreme underfitting (infit MSQ = 0.26-0.31). Despite high inter-rater agreement (κ = 0.75-0.83), PCA revealed unidimensionality violations (eigenvalue = 38.47), suggesting multidimensional construct potential. Human raters exhibited variable severity, contrasting with GenAI's consistent leniency. While the DIF analysis showed significant inter-group differences, the practical impact is minimal. The findings highlighted fundamental measurement challenges, with implications for GenAI's potential for reducing systematic biases in ELL writing assessment.

Authors