Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
This study examined the reliability and efficiency of GPT-4o as a scorer of middle and high school students’ written evaluations of scientific evidence. Using a rubric-anchored prompt, exemplar calibration, and a self-consistency protocol, the AI model achieved near-human interrater agreement (κ = .731–.754) across three interactive scoring batches. Performance collapsed (κ = –.051) when we decontextualized the prompts and removed human oversight. Compared to human scoring, the AI workflow reduced total labor hours, though transcription time offset some of these gains. We conducted all scoring using a no-code interface, demonstrating that non-technical educators and researchers can access valid and efficient LLM-based assessment. Findings highlight the promise—and limits—of deploying LLMs for automated writing assessment in authentic classroom contexts.