Paper Summary
Share...

Direct link:

Using ChatGPT for Large-Scale Writing Project Scoring

Sun, April 14, 9:35 to 11:05am, Pennsylvania Convention Center, Floor: Level 100, Room 119B

Abstract

Purpose
Writing is an important skill that requires high-quality instruction and opportunities to practice with scaffolding and feedback informed by reliable writing assessment data. Typically, such data is generated by trained human raters, but this introduces subjectivity and heavy time and labor costs. Thus, we developed and tested a novel automated scoring model that leveraged ChatGPT to evaluate upper-elementary grade writing. We did so in the context of an efficacy evaluation of the We Write intervention. The purpose of this study was to compare the performance of ChatGPT and human scores and identify areas for refining the machine scoring in the future to be more aligned with human scores.

Theoretical Framework
We Write integrates self-regulated strategy development (SRSD) instruction framework (Harris et al., 2008) with multimedia learning theory (Mayer, 2014) to justify the instructional interactions for both teacher-led and web-based writing instruction. SRSD draws from multiple writing, motivation, and self-regulation theories to promote learning within six recursive stages of instruction (i.e., develop background knowledge, discuss it, model it, memorize it, support it, independent practice). Within the We Write intervention, the web-based activities are designed to focus on the learning activities by minimizing distractions and increase engagement with meaningful activities. Finally, this study draws from research indicating the importance of reliable feedback delivered efficiently (Shute, 2008) to accelerate the practice-feedback loop (Kellogg et al., 2010), hence our decision to explore the use of automated scoring.

Methods
We developed an automated scoring model using ChatGPT, training the model to assign holistic scores that would mirror those of trained raters. Then, we used correlational analysis with graphical representation of the scores to compare the performance of the ChatGPT and human scores.

Data
We used data from 2200 elementary grade students in Grades 4 and 5 collected as part of a larger efficacy evaluation of the We Write intervention in a diverse set of schools (e.g., over 90% students eligible for free or reduced-price lunch and 60% minority).

Results
Observed human scores ranged from 0 to 6 with high frequencies tending towards middle score points (i.e., a centrality effect) while observed ChatGPT scores ranged from 0 to 5 with high frequencies tending towards higher score points (see Figure 3). The scores were correlated at r = .66, Spearman’s ρ = .70, indicating high consistency in rank ordering essay qualities. The single score intraclass correlation was .441 [396, .483]. The average score intraclass correlation was .612 [.567, .652]. Human and ChatGPT scores demonstrated 26.9% exact agreement, 66.6% agreement within one-point, and 94.8% agreement within two-points. The unweighted kappa was .12 [.10, .15]; weighted kappa was .51 [.36, .65].

Significance
This study illustrates how ChatGPT may be used to evaluate elementary student writing reliably, thus presenting a potential alternative to the time consuming and subjective human scoring of essays. Moreover, by leveraging a large-language model, the use of ChatGPT for evaluating writing provides an avenue to closely connect scoring and feedback. Refinement of the ChatGPT scores is necessary to ensure better alignment to address striving writers.

Authors