Paper Summary
Share...

Direct link:

Judging the Judges: A Rasch Analysis of Human and Large Language Model Raters

Sat, April 11, 11:45am to 1:15pm PDT (11:45am to 1:15pm PDT), InterContinental Los Angeles Downtown, Floor: 6th Floor, Mission

Abstract

Large language models (LLMs) have shown promising results in automated writing evaluation that includes providing scores and feedback comments. Their alignment with human scoring primarily evaluates their performance. This research examines the psychometric properties of rater effects in LLMs with Many-Facet Rasch Measurement. Using an English Language Arts assessment dataset, the analysis reveals that LLM mirrored the overall severity of the human expert and demonstrated a good agreement in scores. However, the recommended feedback was divergent. Moreover, an interaction effect between domain and rater was revealed. The LLM demonstrated relative leniency on idea development but higher severity on language conventions. This finding suggests that even when globally aligned with a human rater, LLMs can have subtle variations in scoring.

Authors