AERA Annual Meeting: The Three-Body Solution: Evaluating Reliability of Large Language Models for Coding Open-Ended Responses

Information Menu
Search Tips

Navigation and Settings Menu
Change Preferences / Time Zone
Sign In

Back Home

Refresh: Off

Paper Summary

Share...

Direct link:

The Three-Body Solution: Evaluating Reliability of Large Language Models for Coding Open-Ended Responses

In Event: Artificial Intelligence Applications in Psychometrics: From Scale Development to Response Analysis

Thu, April 24, 9:50 to 11:20am MDT (9:50 to 11:20am MDT), The Colorado Convention Center, Floor: Meeting Room Level, Room 703

Abstract

This study introduces a "three-body solution" for evaluating the reliability of Large Language Models (LLMs) in coding open-ended responses. The inter-rater reliability (IRR) between two human coders was used as a standard and then compared to the IRR between the LLM and each human coder. Applying this approach to 300 randomly sampled responses from a post-school outcomes survey, results show human-LLM IRR alphas (0.75 and 0.79) comparable to the IRR between human coders (0.74). In addition, results suggest that assigning the LLM a researcher role in the prompt improved reliability. LLMs promise reduced labor costs in content coding but need to be evaluated for reliability, validity, and bias like any measurement tool.

The Three-Body Solution: Evaluating Reliability of Large Language Models for Coding Open-Ended Responses

Abstract

Authors