Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
This study investigated systematic differences between human and GenAI raters in ELL writing assessment, using Many-Faceted Rasch Measurement. Analyzing 140 essays from the ICNALE corpus rated by 10 human and 10 GenAI raters across 10 analytic criteria revealed stark contrasts: human raters demonstrated significant overfitting (infit MSQ = 1.14-2.44), while GenAI raters showed extreme underfitting (infit MSQ = 0.26-0.31). Despite high inter-rater agreement (κ = 0.75-0.83), PCA revealed unidimensionality violations (eigenvalue = 38.47), suggesting multidimensional construct potential. Human raters exhibited variable severity, contrasting with GenAI's consistent leniency. While the DIF analysis showed significant inter-group differences, the practical impact is minimal. The findings highlighted fundamental measurement challenges, with implications for GenAI's potential for reducing systematic biases in ELL writing assessment.