Paper Summary
Share...

Direct link:

Comparing English Essay Scoring: LLMs vs. Human Raters

Sat, April 11, 11:45am to 1:15pm PDT (11:45am to 1:15pm PDT), JW Marriott Los Angeles L.A. LIVE, Floor: Ground Floor, Gold 4

Abstract

This study investigates the psychometric properties and scoring characteristics of Large Language Models (LLMs) as automated essay raters, benchmarking their performance against human judgements. Using 1,244 non-native English essays from the Cambridge Learning Corpus, we conducted a comparative analysis of holistic scores from human raters, ChatGPT-4o, and Gemma3-12B. Results indicate significant differences in scoring patterns and a moderate inter-rater reliability (ICC = 0.54). While LLMs showed promising correlations with human scores (r = 0.58 for ChatGPT-4o), Bland-Altman analysis revealed systematic discrepancies, particularly at score extremes. This research offers valuable insights into the reliability, agreement, and potential biases of LLM-based assessment, highlighting both their capabilities and limitations for robust educational applications.

Author