Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Job performance assessments that simulate real-world tasks and elicit respondents’ reasoning are increasingly used to measure leaders' capacities to engage in effective practices. This study examines the viability of Large Language Models (LLMs) to evaluate the substantial text data these assessments often generate. Using responses from 189 aspiring principals in Tennessee to a teacher hiring scenario, we evaluate six LLMs across three prompting strategies with a structured codebook. LLMs produced valid responses in 96% of item scores and showed reduced variability with detailed prompting. Higher-reasoning models (e.g., GPT-4o, Claude 3.7-Sonnet) demonstrate strong inter-rater reliability with trained human annotators. Findings suggest LLMs can scale assessment scoring while preserving fidelity to human norms, advancing the methodological toolkit for leadership researchers.