AERA Annual Meeting: Automatic Evaluation of Conversational AI Chatbots Using Large Language Models

Information Menu
Search Tips

Navigation and Settings Menu
Change Preferences / Time Zone
Sign In

Back Home

Refresh: Off

Paper Summary

Share...

Direct link:

Automatic Evaluation of Conversational AI Chatbots Using Large Language Models

In Event: The Feasibility and Effectiveness of AI in Learning and Assessment

Sat, April 26, 3:20 to 4:50pm MDT (3:20 to 4:50pm MDT), The Colorado Convention Center, Floor: Meeting Room Level, Room 705

Abstract

Recent advancements in automatic evaluation have made significant progress, yet evaluating learner-created computational artifacts such as project-based code remains challenging. This study investigates the capability of GPT-4, a state-of-the-art Large Language Model (LLM), in assessing learner-created computational artifacts. Specifically, we analyze the source code of 75 chatbots predominantly built by middle school learners. We compare four LLM prompting strategies ranging from example-based to rubric-informed approaches. The experimental results indicate that the LLM-based evaluation module achieves substantial agreement(Cohen’s weighted κ = 0.797) with human evaluators in two of five artifact dimensions, moderate agreement in one, and fair agreement in the remaining two dimensions. The findings demonstrate the potential of LLMs for automatically evaluating project-based, open-ended computational artifacts.

Automatic Evaluation of Conversational AI Chatbots Using Large Language Models

Abstract

Authors