Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
X (Twitter)
Use of artificial intelligence (AI) to score responses is growing in popularity and likely to increase. Evidence of the validity of scores relies on quadratic weighted kappa (QWK) to demonstrate agreement with human ratings. QWK has several known shortcomings. The proportional reduction in mean squared error (PRMSE) estimates the prediction accuracy of the automated scoring model, with respect to prediction of the human true scores. Analysis of operational test data demonstrates QWK and PRMSE can lead to different conclusions about AI scores. Extensive simulation study results show PRMSE is robust to many of the factors to which QWK is sensitive and should replace or complement QWK for evaluating AI scores. We investigate sample size requirements for accurate estimation of PRMSE