Search
Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Session Type
Personal Schedule
Sign In
Access for All
Exhibit Hall
Hotels
WiFi
Search Tips
Qualitative inquiry has long grappled with the tension between producing in-depth granular data that answers unique and important questions (Cicourel 1982; Gans 1999) and persistent concerns about reproducibility, transparency, and the challenge of validating claims (Abramson and Dohan 2015; Duneier 2011; Goldthorpe 2000). Though recent works aim to produce public use qualitative data sets (Edin et al. 2024), tensions between traditional qualitative approaches, FAIR principles for shared data (Wilkinson et al. 2016), and ethnographic scandals (Lubet 2017) have not been resolved. Further, little work accounts for the emergent possibilities and pitfalls for primary qualitative data sources created by the expansion of artificial intelligence, which not only provides a new toolkit for social science but places all information at risk of becoming data infrastructure for commercial language model training.
Longstanding debates address masking, pseudonyms, and anonymization (Reyes 2018), distinctions between quantitative replication and qualitative analogs (Gong and Abramson 2020; Duneier 2011; Klinenberg 2006), and open-science repositories for qualitative text (Murphy, Jerolmack, and Smith 2021). These concerns become more urgent as qualitative works scale up to larger samples and data processing intersects with AI tools (Abramson et al. 2026).
Alongside works arguing that practical elements of data construction are essential but overlooked (Pardo-Guerra and Pahwa 2022; Brower et al. 2019), this paper introduces a deidentification tool (De-id) for the systematic replacement of direct and indirect identifiers with standardized placeholders. Despite its implementation in large scale projects, De-id has received little attention in sociology relative to fields like medicine (DuBois et al. 2023).
We implement offline machine learning, usable on consumer hardware, to deidentify and index qualitative data at multiple levels of security and contextual granularity, adapting best practices from clinical NLP and computational ethnography (Li and Abramson 2025). Our tool supports tiered access to studies containing over 2,000 interviews and over 77,000 paragraphs of ethnographic text. We argue that offline deidentification is a proactive, researcher-controlled step for data protection in an era of AI that can also aid in addressing longstanding issues around transparent and reproducible qualitative analysis.