Paper Summary
Share...

Direct link:

Human-AI Co-Grading of Open-Ended Responses to STEM Transfer Questions (Stage 1, 2:17 PM)

Wed, April 8, 1:45 to 3:15pm PDT (1:45 to 3:15pm PDT), Los Angeles Convention Center, Floor: Level One, Exhibit Hall A - Stage 1

Abstract

Assessing learning transfer – students’ ability to apply knowledge in new contexts, is vital but difficult to measure, especially through open-ended questions that require nuanced reasoning. This study investigates how large language models (LLMs), specifically GPT-4o, can support scoring of open-ended science transfer responses. Using 408 student responses from high school biology questions, we implemented a human-AI co-grading pipeline, identified strengths and limitations of LLM-based scoring, and proposed techniques for calibrating models to improve overall accuracy. While LLMs scored surface-level idea breadth reliably, assessing reasoning depth required multiple iterations to achieve sufficient interrater reliability with a human grader. Our findings suggest a practical workflow for calibrating LLMs to improve scoring accuracy, and inform future research on automated assessment at scale.

Authors