Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Large language models (LLMs) have shown promising results in automated writing evaluation that includes providing scores and feedback comments. Their alignment with human scoring primarily evaluates their performance. This research examines the psychometric properties of rater effects in LLMs with Many-Facet Rasch Measurement. Using an English Language Arts assessment dataset, the analysis reveals that LLM mirrored the overall severity of the human expert and demonstrated a good agreement in scores. However, the recommended feedback was divergent. Moreover, an interaction effect between domain and rater was revealed. The LLM demonstrated relative leniency on idea development but higher severity on language conventions. This finding suggests that even when globally aligned with a human rater, LLMs can have subtle variations in scoring.