Paper Summary
Share...

Direct link:

Variability in AI-Generated Personas: Method Comparison and Psychometric Assessment

Fri, April 10, 1:45 to 3:15pm PDT (1:45 to 3:15pm PDT), InterContinental Los Angeles Downtown, Floor: 5th Floor, Wilshire Grand Ballroom I

Abstract

Objectives
Large language models (LLMs) have shown promise in generating synthetic survey responses informed by qualitative data. This study investigates response variability across persona generation methods, examining effects of temperature, persona count, and prompt specificity. It aims to enhance data reliability and inform best practices for AI-based data augmentation in quantitative research.
Framework
Personas are demographic and behavioral templates for generating synthetic survey responses with LLMs (Li et al., 2025). Repeated sampling per persona (Schuller et al., 2024) enhances statistical power while preserving latent construct integrity, making it valuable for psychometric augmentation. Persona-level repeated sampling approaches include: (1) multiple-conversation, involved repeating the same prompt multiple times per respondent to produce diverse conversational outputs; (2) single-conversation, generating responses using a single prompt administered to a group of synthetic personas simultaneously; (3) probability sampling, drawing cases from the model’s token probability distribution derived from single-prompt conversation outputs.
LLMs generate text by sampling from a probability distribution, with a temperature parameter controlling output variability. Tuning temperature in persona generation balances individual consistency and group variation, revealing how sampling randommess drives synthetic response variability.
Method
This study examined health-based behavioral regulation in 19 after-school staff using interviews and the 15-item Behavioral Regulation in Exercise Questionnaire (BREQ), a 6-point scale assessing four subscales: external, introjected, identified, and intrinsic regulation. Synthetic data were generated in a factorial design varying prompt content, randomness, persona count, and LLM model. Three prompt conditions (survey only; survey + interview; survey + interview + demographics) were crossed with two temperature settings (0.3 for low randomness, 0.8 for high), three persona sample sizes per staff member (10, 50, 100), and three LLMs (GPT‑4.1, Gemini 2.5 Flash, and Claude 4 Sonnet).
We produced synthetic responses via three persona‑level approaches: multiple‑conversation (multiConv), single‑conversation (singleConv), and probability (prob) sampling. Generated and human responses were then compared using descriptive statistics, RMSE, scale reliability coefficients, and CFA model fit indices.
Results
Preliminary results using GPT‑4.1 with 10 synthetic personae per staff showed that multiConv and singleConv had means closer to human data than prob, but with lower SDs (Figure 2.3-2.4). MultiConv showed minimal variability at temperature = 0.3 and the lowest RMSEs (Table 2.1), while prob matched human SDs better but had the largest mean deviation and highest RMSEs. The diffuse response probabilities from prob made outputs less controlled, reducing the fit. SingleConv achieved the highest reliability (Table 2.2) and best model fit. The four-factor model fit multiConv and singleConv data well but failed for human and prob data (Table 2.3). These results highlight LLMs’ value where real samples may fail.
Significance
LLM data augmentation accelerates psychometric research by replacing prolonged data collection and facilitating rapid item‑model refinements. It may overcome small-sample and convergence issues through large synthetic cohorts, allowing testing psychometric modeling under controlled scenarios when real data are limited, thereby improving the robustness of results.

Authors