Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
The present study assesses the effectiveness of GPT-4.0 in grading short-answer responses from the PISA 2018 science assessment using comprehensive rubrics and prompts, compared to human raters. Intraclass Correlation Coefficients (ICCs) were used to measure the agreement between GPT-4.0 and human scores for 390 responses to 13 item questions, with GPT-4.0 achieving a high ICC of 0.91, reflecting strong alignment with human grading. Although some misclassification occurred, particularly between partial and full credit responses, GPT-4.0 demonstrates promise for reliable automated grading. This research highlights the potential of LLMs to improve grading efficiency, reduce subjective bias, and provide real-time feedback, advancing the use of AI in educational assessments and encouraging further exploration of AI tools.