Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
This study explores how large language models (LLMs) can simulate survey responses to detect measurement bias in psychometric instruments. Using the CES-D depression scale as a case study, we prompted four proprietary LLMs (GPT-4o, GPT-3.5, Gemini, and Claude) to respond as demographically distinct personas. We evaluated differential item functioning (DIF) using the generalized partial credit model and visualized item response curves. While LLM responses revealed gender-based DIF, the patterns diverged from those found in human data. Further content and textual analyses suggested that item-level bias may stem more from sampling variance than item wording. Our work demonstrates how generative AI can inform scale development and bias detection, offering a new methodological tool for education and psychological research.