Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
This study introduces a "three-body solution" for evaluating the reliability of Large Language Models (LLMs) in coding open-ended responses. The inter-rater reliability (IRR) between two human coders was used as a standard and then compared to the IRR between the LLM and each human coder. Applying this approach to 300 randomly sampled responses from a post-school outcomes survey, results show human-LLM IRR alphas (0.75 and 0.79) comparable to the IRR between human coders (0.74). In addition, results suggest that assigning the LLM a researcher role in the prompt improved reliability. LLMs promise reduced labor costs in content coding but need to be evaluated for reliability, validity, and bias like any measurement tool.