Paper Summary
Share...

Direct link:

Automatic Evaluation of Conversational AI Chatbots Using Large Language Models

Sat, April 26, 3:20 to 4:50pm MDT (3:20 to 4:50pm MDT), The Colorado Convention Center, Floor: Meeting Room Level, Room 705

Abstract

Recent advancements in automatic evaluation have made significant progress, yet evaluating learner-created computational artifacts such as project-based code remains challenging. This study investigates the capability of GPT-4, a state-of-the-art Large Language Model (LLM), in assessing learner-created computational artifacts. Specifically, we analyze the source code of 75 chatbots predominantly built by middle school learners. We compare four LLM prompting strategies ranging from example-based to rubric-informed approaches. The experimental results indicate that the LLM-based evaluation module achieves substantial agreement(Cohen’s weighted κ = 0.797) with human evaluators in two of five artifact dimensions, moderate agreement in one, and fair agreement in the remaining two dimensions. The findings demonstrate the potential of LLMs for automatically evaluating project-based, open-ended computational artifacts.

Authors