Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Objectives
Large language models (LLMs) have immense potential for understanding textual data, which allows researchers to use them in automated short-answer grading (ASAG). Recent research has exhibited promising results of using LLMs for grading student work in domains such as STEM (e.g., Xie et al., 2024). Existing works have primarily focused on evaluating the scoring accuracy of LLMs, with techniques that include prompting (Cohn et al., 2024), self-reflection (Xie et al., 2024), and fine-tuning (Latif et al., 2023). However, recognizing the user’s underlying thought process is also needed when grading responses to capture the nuances in the user’s knowledge.
Perspectives
In this paper, we propose an automated grading framework for capturing teachers’ thinking and knowledge in a nuanced way. Inspired by recent works on prompting with language feedback (Levi et al., 2024; Yang et al., 2023), we propose a model that iteratively tunes a set of grading guidelines through a multi-agent-based Explain-and-Refine module to learn the divergences in teachers’ explanations. With expert-designed grading guidelines and a labeled training dataset of teacher responses, the Explainer collaborates with the Refiner in the following prescribed way: the Explainer annotates responses, with no knowledge of the labels; when incorrect, the Refiner reflects on the Explainer’s mistakes and devises rules to guide the Explainer toward the correct classification based on labels. The collaboration iteratively responds until a threshold accuracy or the maximum number of rounds is reached. The Explain-and-Refine module has two modes. The human-in-the-loop mode involves humans who review the rules generated after each iterative cycle, whereas the full-automation mode (Pryzant et al., 2023) depends solely on the LLM’s self-learning and reflection abilities. Both are explored in our framework.
Methods and Data Sources
Our experimental dataset was collected from more than 200 middle school mathematics teachers who participated in a study piloting a measure to assess teachers’ content and pedagogical content knowledge of ratios and proportional relationships. The measure included open-text responses, which were used in this study. Each response was coded by two human raters on a 3-point scale. We compared the performance of our framework with baselines (Liu et al., 2019; Reimers et al., 2019; Pryzant et al., 2023). The model used for experimentation was GPT-4 (OpenAI, 2024).
Results
As shown in Table 1, our proposed framework outperformed the baselines on three questions selected from different knowledge domains. Specifically, the human-in-the-loop guidelines led to the highest performance when used for grading with 1 or 3 Graders (Li et al., 2024). The guidelines generated under the full-automation mode had marginally lower performance but still surpassed the baseline ProTeGi model, with higher stability and accuracy (Figure 1), both of which are essential for actual grading tasks, where labeling data may not be feasible.
Significance
In this paper, we proposed an ASAG framework to align the guidelines with the assessment objectives and respondents’ knowledge. The automated aspect of this model effectively mitigates the human burden of modifying the guidelines and annotating the data samples used for grading, and it demonstrates reliable accuracy and knowledge capture.