Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Objectives
This study explores the effectiveness of large language models (LLMs) in generating quantitative survey responses based on qualitative interview data, addressing a significant challenge in mixed methods designs: the structural alignment between quantitative data (like standardized Likert-scale responses) and qualitative data (such as open-ended interviews).
Framework
Mixed methods research leverages both qualitative and quantitative data but often faces challenges in integrating the two due to structural differences. Emerging LLMs offer new potential by translating qualitative insights into analyzable quantitative formats. One approach involves simulating synthetic personas: generalized user profiles based on real data (Li et al., 2025). Prompt engineering (Marvin et al., 2023) can guide LLMs by embedding qualitative cues (e.g., interview snippets) and quantitative elements (e.g., scale definitions, numeric anchors). This approach helps generate responses that reflect human nuance and align with measurement metrics, ensuring comparability with human survey data. When provided with appropriate contextual inputs, LLMs can approximate population-level opinions (Yu et al., 2024). These capabilities allow researchers to model and predict human behavior at scale.
Method
Using data from 19 after-school program staff who completed both the Behavioral Regulation in Exercise Questionnaire (BREQ) and individual interviews, a simulation was conducted to examine three leading LLMs—OpenAI’s GPT-4.1, Google’s Gemini 2.0 Flash, and Anthropic’s Claude 3.7 Sonnet—under varying settings: two levels of temperature: 0 and 0.5 (which modulate response randomness) and four prompt configurations differing in the amount of contextual and personal information provided, including combinations of research context, personal interviews, and demographic details.
Results
As shown in Figure 2.1 and Figure 2.2, the results indicated that LLMs generally replicated the average patterns of human responses well but showed significantly less variability, suggesting a tendency toward homogeneity compared to the diverse ways humans respond. Of the models, Claude demonstrated the closest alignment with human data, followed by GPT, with Gemini lagging. Prompts enriched with personal interview data, whether alone or with demographic information, yielded the strongest correlation with actual human responses (Pearson’s r between 0.5 and 0.73), outperforming prompts that included only the research background or demographic details. However, increasing the amount of demographic information did not substantially improve alignment beyond what was gained from interview data alone.
Furthermore, when assessing aggregate test scores (Relative Autonomy Index), none of the LLMs, except for Claude under information-rich prompt conditions, reliably reproduced the complex psychometric constructs present in human data, illustrating ongoing limitations in LLMs’ ability to model intricate psychological constructs from qualitative context.
Significance
The findings underscore the potential of LLMs—especially when guided with high-quality, personally relevant qualitative data—to enhance the integration of qualitative and quantitative insights in social science research; yet, they also caution that current LLMs may overlook individual differences and struggle with nuanced emotional or aggregate-level interpretation. To maximize the benefit of LLMs in mixed methods research, careful attention must be paid to prompt design and the type of qualitative data provided, while recognizing that LLMs are not yet a full substitute for human interpretation where complexity and subtlety are essential.