Paper Summary
Share...

Direct link:

LLM (Large Language Models)-Based Automated Grading for Constructed Responses in Science Inquiry

Fri, April 25, 8:00 to 9:30am MDT (8:00 to 9:30am MDT), The Colorado Convention Center, Floor: Terrace Level, Bluebird Ballroom Room 3C

Abstract

The present study assesses the effectiveness of GPT-4.0 in grading short-answer responses from the PISA 2018 science assessment using comprehensive rubrics and prompts, compared to human raters. Intraclass Correlation Coefficients (ICCs) were used to measure the agreement between GPT-4.0 and human scores for 390 responses to 13 item questions, with GPT-4.0 achieving a high ICC of 0.91, reflecting strong alignment with human grading. Although some misclassification occurred, particularly between partial and full credit responses, GPT-4.0 demonstrates promise for reliable automated grading. This research highlights the potential of LLMs to improve grading efficiency, reduce subjective bias, and provide real-time feedback, advancing the use of AI in educational assessments and encouraging further exploration of AI tools.

Authors