AERA Annual Meeting: Exploring Large Language Models' Responses to Moral Reasoning Dilemmas

Information Menu
Search Tips

Navigation and Settings Menu
Change Preferences / Time Zone
Sign In

Back Home

Refresh: Off

Paper Summary

Share...

Direct link:

Exploring Large Language Models' Responses to Moral Reasoning Dilemmas

In Event: "Empowering Youth With Moral Education" SIG 67: Moral Development and Education, Paper Session

Fri, April 25, 8:00 to 9:30am MDT (8:00 to 9:30am MDT), The Colorado Convention Center, Floor: Meeting Room Level, Room 606

Abstract

The current study compares the capabilities of various large language models (LLMs) on moral reasoning. For DIT-2, Claude had the highest post-conventional score (P-score) of 72. This was followed by Gemini Advanced (P-score 64) and Gemini (P-score 58). The other LLMs' performance was listed as follows: Grok (P-score 48), ChatGPT-4O (P-score 44), ChatGPT-4 (P-score 44), and ChatGPT-3.5, which had the lowest scores (P-score 18). The results indicate that LLMs can simulate high levels of moral reasoning. For the ICM Educational Leaders version, Gemini Advanced had the highest total ICM score of 0.90. It was followed by Gemini (0.86), Claude (0.86), ChatGPT-4O (0.78), ChatGPT-4 (0.78), Grok (0.61), and ChatGPT-3.5 (0.32).

Exploring Large Language Models' Responses to Moral Reasoning Dilemmas

Abstract

Authors