Paper Summary
Share...

Direct link:

The Three-Body Solution: Evaluating Reliability of Large Language Models for Coding Open-Ended Responses

Thu, April 24, 9:50 to 11:20am MDT (9:50 to 11:20am MDT), The Colorado Convention Center, Floor: Meeting Room Level, Room 703

Abstract

This study introduces a "three-body solution" for evaluating the reliability of Large Language Models (LLMs) in coding open-ended responses. The inter-rater reliability (IRR) between two human coders was used as a standard and then compared to the IRR between the LLM and each human coder. Applying this approach to 300 randomly sampled responses from a post-school outcomes survey, results show human-LLM IRR alphas (0.75 and 0.79) comparable to the IRR between human coders (0.74). In addition, results suggest that assigning the LLM a researcher role in the prompt improved reliability. LLMs promise reduced labor costs in content coding but need to be evaluated for reliability, validity, and bias like any measurement tool.

Authors