Paper Summary
Share...

Direct link:

LLM Raters of Writing and Reasoning: Towards Valid and Efficient Automated Assessment

Sat, April 11, 9:45 to 11:15am PDT (9:45 to 11:15am PDT), JW Marriott Los Angeles L.A. LIVE, Floor: Ground Floor, Gold 4

Abstract

This study examined the reliability and efficiency of GPT-4o as a scorer of middle and high school students’ written evaluations of scientific evidence. Using a rubric-anchored prompt, exemplar calibration, and a self-consistency protocol, the AI model achieved near-human interrater agreement (κ = .731–.754) across three interactive scoring batches. Performance collapsed (κ = –.051) when we decontextualized the prompts and removed human oversight. Compared to human scoring, the AI workflow reduced total labor hours, though transcription time offset some of these gains. We conducted all scoring using a no-code interface, demonstrating that non-technical educators and researchers can access valid and efficient LLM-based assessment. Findings highlight the promise—and limits—of deploying LLMs for automated writing assessment in authentic classroom contexts.

Authors