Paper Summary
Share...

Direct link:

Can Large Language Models Generate Reliable Test Items? A Hallucination-Centric Evaluation

Sat, April 11, 1:45 to 3:15pm PDT (1:45 to 3:15pm PDT), InterContinental Los Angeles Downtown, Floor: 7th Floor, Hollywood Ballroom I

Abstract

Large language models (LLMs) have accelerated the large-scale creation of test items for educational assessments. However, the reliability of the generated items, especially their vulnerability to hallucinations, has not been researched in depth. This paper provides a hallucination-centric evaluation based on multiple-choice questions (MCQs) from the RACE dataset. First, our results show that keys generated by LLMs have lower levels of hallucinations compared to human-crafted ones, while their distractors show higher levels of hallucinations. Second, LLMs tend to choose distractors with varying levels of factual accuracy, illustrated by a greater disparity between keys and distractors. Third, LLMs have difficulty ensuring consistent factual accuracy across all components of the items, as shown by weaker correlations between keys and distractors.

Authors