Paper Summary
Share...

Direct link:

Measurement of Critical Thinking: The quality of LLM-based essay coding of Performance Assessment Tasks

Sun, April 12, 9:45 to 11:15am PDT (9:45 to 11:15am PDT), InterContinental Los Angeles Downtown, Floor: 7th Floor, Hollywood Ballroom I

Abstract

1. Objective
The current work documents the progress in the performance-based assessment of critical thinking (CT) towards an automated essay coding (AEC) system based on a multi-shot LLM approach.
2. Theoretical Framework
The reconceptualization of CT as a “21st century skill” (Braun et al., 2020) lead to an effort to develop valid measures (Shavelson et al., 2019). Performance Assessment (PA) Tasks with essays as data base has prevailed as the most accepted format within iPAL-Consortium (Shavelson et al., 2018).
Reliable coding of PA task essays has been a challenge because coder training is time consuming compared to conventional CT measures. Ebright-Jones and Cortina (2025) presented a PA that reduces the complexity of the task and simplifies the coding strategy: Each task consists of 7 to 8 authentic documents which contain as many arguments in favor as in opposition of a proposal. Trained coders identify the arguments an essay engages with and, among other details, whether trustworthiness of resources is reflected. The interrater reliability averages Cohen’s kappa = .75. While this is considered high, it still introduces a significant amount of error variance. Based on observations during coder training, the main coder weakness is to miss arguments. We were therefore developing a LLM model to address this weakness.
3. Methods
Different from other LLMs, Claude.ai (Sonnet 4) can utilize user-supplied supplementary data files without integrating it into its information corpus. For our PA task, we compiled a spreadsheet that included, for each of the 30 arguments across the documents, sample phrases identified by 5 independent coders from 213 past-collected essays.
4. Data Sources
We entered 25 new essays from the latest data collection (fall 2024) which were first coded traditionally by one or two trained coders. The following prompt was given to Claude:
“In the word document, arguments are labeled as R, T, -15 to +15. Which arguments does the following essay use? If a statement fits more than one label, use the one that fits best.” Note that R and T refer to explicit mention of a document source (R) and a reflection of the trustworthiness of a source (T).
5. Results
We analyzed the number of positive and negative arguments identified by the human coders and Claude. As predicted, the LLM identified consistently more arguments than the human coders. Broken down by individual arguments, the human coders and LLM are very consistent in which arguments were identified, with human coders occasionally missing a code. Consistent with this observation, the correlation of LLM with the highest score of the two human coders (for positive and negative argument separately) is higher than each correlation of LLM with individual coders, which, in turn, are higher than the correlations of the two coders (traditional interrater reliability).
6. Significance
We conclude that LLM is a promising alternative to human coding of PA essays when the universe of arguments is known for the task.

Authors