Individual Submission Summary
Share...

Direct link:

A Validation Framework for Automated Transcription in Social Science

Mon, August 10, 2:00 to 3:30pm, TBA

Abstract

Automated speech recognition (ASR) tools such as Whisper, Otter.ai, and Descript are increasingly routine infrastructure in social science research and in the organizations social scientists study. The appeal is obvious: ASR promises to transform hours of audio into searchable, codable text in minutes. The question this paper asks is simple: how well do these tools actually work, and for whom? We audit ASR performance on publicly available recordings of parole and pardon hearings in Connecticut and Louisiana, a setting that combines high institutional consequence with extreme linguistic diversity, and propose a validation framework for ASR use in social science research. Spoken language is socially structured in ways that matter for ASR performance. Accent, dialect, register, and acoustic conditions are systematically patterned by race, class, and region. When ASR tools learn to hear from training corpora skewed toward standard-variety English, their errors are not randomly distributed: they concentrate among speakers whose speech departs most from that implicit norm. Our preliminary findings from Connecticut confirm this pattern and reveal an additional problem. Applying OpenAI's Whisper-Small to 87 segments of a single pardon hearing, we find a mean segment-level word error rate (WER) of 40.4% and a median of 18.9%, with 30 of 87 segments exceeding 25% WER. More troublingly, four segments exhibit hallucination: the model generates entirely fabricated content, including a loop of a single English sentence repeated dozens of times while a Spanish-speaking applicant addresses the board. A hallucinated transcript is unflagged. It looks, typographically, like any other transcript. We propose a tiered validation framework analogous to interrater reliability reporting in qualitative coding: segments below 10% WER are reliable for most purposes; 10--25% warrants selective review; above 25% requires full re-transcription; hallucinated segments must be excluded entirely. We extend this analysis to Louisiana and additional states to test whether WER scales with sociolinguistic distance from ASR training norms, and decompose error rates by speaker role to assess whether failures concentrate among applicants relative to institutional authorities.

Authors