Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Assessing learning transfer – students’ ability to apply knowledge in new contexts, is vital but difficult to measure, especially through open-ended questions that require nuanced reasoning. This study investigates how large language models (LLMs), specifically GPT-4o, can support scoring of open-ended science transfer responses. Using 408 student responses from high school biology questions, we implemented a human-AI co-grading pipeline, identified strengths and limitations of LLM-based scoring, and proposed techniques for calibrating models to improve overall accuracy. While LLMs scored surface-level idea breadth reliably, assessing reasoning depth required multiple iterations to achieve sufficient interrater reliability with a human grader. Our findings suggest a practical workflow for calibrating LLMs to improve scoring accuracy, and inform future research on automated assessment at scale.