Paper Summary
Share...

Direct link:

Hallucination vs Interpretation: Rethinking Accuracy and Precision in AI-Assisted Data Extraction for Knowledge Synthesis

Wed, April 8, 11:45am to 1:15pm PDT (11:45am to 1:15pm PDT), JW Marriott Los Angeles L.A. LIVE, Floor: 2nd Floor, Platinum J

Abstract

Introduction: Knowledge syntheses, or literature reviews, are foundational to educational scholarship in the professions, consolidating findings to develop and refine theories and practices. Data extraction, integral to knowledge syntheses, is labor-intensive, requiring human researchers to systematically gather detailed information across multiple manuscripts. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), offer potential efficiency improvements but raise significant concerns about accuracy. Specifically, distinguishing AI-generated "hallucinations" – fabricated or incorrect content – from legitimate interpretative variability driven by subjective judgments is crucial to assessing AI’s suitability for data extraction.
Methods: We developed an extraction platform, MAKMAO (Machine-Assisted Knowledge extraction, Multiple-Agent Oversight), utilizing LLMs for automated data extraction. We evaluated extraction accuracy by comparing AI-generated responses with human responses across 187 manuscripts from a published scoping review in medical education. We measured consistency using interrater reliability for categorical responses and thematic similarity ratings for open-ended responses. Human-human consistency was assessed through manual re-extraction of a targeted subset of data, providing a comparative benchmark. Additionally, AI-AI consistency was evaluated through repeated extractions of identical question/publication pairs to explore variability and interpretability.
Results: MAKMAO demonstrated high consistency with human responses for straightforward extraction questions explicitly addressed within manuscripts (e.g., title, aims). However, consistency decreased for questions requiring subjective interpretation or lacking explicit manuscript descriptions (e.g., Kirkpatrick’s outcomes, methodological rationale). Notably, human-human comparisons revealed similar patterns of variability, suggesting that interpretive differences among human researchers significantly contributed to observed discrepancies. AI-AI consistency further reinforced the conclusion that interpretability, rather than hallucination, was the predominant source of variability, with repeated AI extractions effectively flagging interpretative complexity without necessitating extensive human input.
Discussion: Our findings indicate that variability in AI-assisted data extraction predominantly stems from interpretative complexity rather than hallucination. This interpretive variability mirrors human extraction practices, highlighting intrinsic subjectivity in knowledge synthesis tasks. Consequently, while AI holds promise as a transparent and reliable partner in knowledge synthesis, caution is warranted regarding over-reliance on AI-generated interpretations. Excessive reliance on AI might inadvertently neglect critical human insights, contextual knowledge, and expertise essential to nuanced understanding.
Leveraging repeated AI extractions allows researchers to identify and refine questions prone to interpretive ambiguity. By distinguishing legitimate interpretative variability from undesirable ambiguity or hallucination, researchers can better integrate AI assistance strategically, preserving methodological rigor while enhancing efficiency.
Conclusion: This study underscores the necessity of critically evaluating AI-generated data extraction, highlighting interpretability as a pivotal factor influencing consistency across both human and AI extractors. AI platforms like MAKMAO, particularly when used iteratively, can assist researchers in identifying and clarifying interpretative complexities inherent in knowledge synthesis tasks. Thoughtful integration of AI thus offers opportunities to systematize extraction processes, standardize responses, and significantly improve the efficiency and depth of knowledge syntheses in educational research.

Authors