Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
X (Twitter)
Researchers have sought for decades to automate holistic essay scoring in order to reduce the burden on teachers and provide timely feedback to students. Over the years, these programs have improved significantly and can reach substantial agreement with human raters. However, such accuracy requires significant amounts of training on human-scored texts -- reducing the expediency and usefulness of such programs for routine uses by teachers across the nation on non-standardized prompts. This study analyzes the output of multiple versions of ChatGPT scoring of secondary student essays from two extant corpora and compares it to quality human ratings. We find that the current iteration of ChatGPT fails to substantially compare to human ratings.
Tamara Powell Tate, University of California - Irvine
Jacob Steiss, University of Missouri - St. Louis
Mark Warschauer, University of California - Irvine
Drew Bailey, University of California - Irvine
Daniel Ritchie, University of California - Irvine
Waverly Tseng, University of California - Irvine
Youngsun Moon, Stanford University
Steve Graham, Arizona State University