Paper Summary
Share...

Direct link:

Can We Trust AI as a Score Rater? A Preliminary Study on the Reliability and Validity of a GenAI Model’s Assessment of Learning Performance Data

Sun, April 12, 7:45 to 9:15am PDT (7:45 to 9:15am PDT), JW Marriott Los Angeles L.A. LIVE, Floor: Ground Floor, Gold 4

Abstract

This preliminary study explored the reliability and validity of Google Gemini 2.5 Pro, a Generative AI (GenAI) model, in assessing learning performance. We compared AI-generated scores with automated ratings given by the VR system for 77 nursing students' transcripts from a virtual reality patient encounter training using a 15-point rubric. Two rounds of AI scoring were conducted with different prompts. While average scores were comparable across raters (ANOVA, p > 0.05), inter-rater reliability between the VR system and AI was low (Cohen’s Kappa: 0.20 and 0.37). Criterion-related validity was moderate (Pearson's r: 0.65 and 0.77). Notably, prompting the AI to provide a detailed rationale improved both reliability and validity, suggesting that prompt engineering is crucial for enhancing GenAI's assessment accuracy.

Authors