Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
This study investigates the psychometric properties and scoring characteristics of Large Language Models (LLMs) as automated essay raters, benchmarking their performance against human judgements. Using 1,244 non-native English essays from the Cambridge Learning Corpus, we conducted a comparative analysis of holistic scores from human raters, ChatGPT-4o, and Gemma3-12B. Results indicate significant differences in scoring patterns and a moderate inter-rater reliability (ICC = 0.54). While LLMs showed promising correlations with human scores (r = 0.58 for ChatGPT-4o), Bland-Altman analysis revealed systematic discrepancies, particularly at score extremes. This research offers valuable insights into the reliability, agreement, and potential biases of LLM-based assessment, highlighting both their capabilities and limitations for robust educational applications.