Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
This preliminary study explored the reliability and validity of Google Gemini 2.5 Pro, a Generative AI (GenAI) model, in assessing learning performance. We compared AI-generated scores with automated ratings given by the VR system for 77 nursing students' transcripts from a virtual reality patient encounter training using a 15-point rubric. Two rounds of AI scoring were conducted with different prompts. While average scores were comparable across raters (ANOVA, p > 0.05), inter-rater reliability between the VR system and AI was low (Cohen’s Kappa: 0.20 and 0.37). Criterion-related validity was moderate (Pearson's r: 0.65 and 0.77). Notably, prompting the AI to provide a detailed rationale improved both reliability and validity, suggesting that prompt engineering is crucial for enhancing GenAI's assessment accuracy.
Hao He, Emporia State University
Xinhao Xu, University of Missouri
Jhon Bueno-Vesga, Pennsylvania State University
Yupei Duan, University of Missouri
Shangman Li, University of Missouri
Yuanyuan Gu, University of Missouri
Sue Yun Fowler, University of Missouri
Hillary L. Claunch, University of Missouri
Jason Snyder, University of Missouri