Paper Summary
Share...

Direct link:

Divergent Preferences Between Human Raters and Large Language Models in EFL Essay Scoring

Sat, April 11, 1:45 to 3:15pm PDT (1:45 to 3:15pm PDT), InterContinental Los Angeles Downtown, Floor: 7th Floor, Hollywood Ballroom I

Abstract

This study investigated the latent preferences of an LLM (Qwen) versus human raters in essay scoring, analyzing their differential weighting of 16 textual features from 505 Chinese high school English essays.While overall scores showed high consistency, the LLM exhibited a marked preference for language features (grammatical accuracy, lexical sophistication, syntactic complexity), especially in high-scoring essays. Human raters, conversely, tolerated minor errors, emphasizing visual presentation and essay content. Lexical complexity compensated for grammar/spelling errors in LLM ratings, and LLMs demonstrated higher cross-group rating stability.This study advises caution when applying LLMs in essay scoring and feedback, stressing the need for enhanced transparency to improve test validity and AI-generated feedback quality.

Authors