Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Large language models (LLMs) have accelerated the large-scale creation of test items for educational assessments. However, the reliability of the generated items, especially their vulnerability to hallucinations, has not been researched in depth. This paper provides a hallucination-centric evaluation based on multiple-choice questions (MCQs) from the RACE dataset. First, our results show that keys generated by LLMs have lower levels of hallucinations compared to human-crafted ones, while their distractors show higher levels of hallucinations. Second, LLMs tend to choose distractors with varying levels of factual accuracy, illustrated by a greater disparity between keys and distractors. Third, LLMs have difficulty ensuring consistent factual accuracy across all components of the items, as shown by weaker correlations between keys and distractors.