Paper Summary
Share...

Direct link:

Auditing Item Fairness With AI-Simulated Personas: A DIF Analysis of the CES-D Scale

Thu, April 9, 7:45 to 9:15am PDT (7:45 to 9:15am PDT), JW Marriott Los Angeles L.A. LIVE, Floor: Ground Floor, Gold 4

Abstract

This study explores how large language models (LLMs) can simulate survey responses to detect measurement bias in psychometric instruments. Using the CES-D depression scale as a case study, we prompted four proprietary LLMs (GPT-4o, GPT-3.5, Gemini, and Claude) to respond as demographically distinct personas. We evaluated differential item functioning (DIF) using the generalized partial credit model and visualized item response curves. While LLM responses revealed gender-based DIF, the patterns diverged from those found in human data. Further content and textual analyses suggested that item-level bias may stem more from sampling variance than item wording. Our work demonstrates how generative AI can inform scale development and bias detection, offering a new methodological tool for education and psychological research.

Author