Paper Summary
Share...

Direct link:

Exploring Large Language Models' Responses to Moral Reasoning Dilemmas

Fri, April 25, 8:00 to 9:30am MDT (8:00 to 9:30am MDT), The Colorado Convention Center, Floor: Meeting Room Level, Room 606

Abstract

The current study compares the capabilities of various large language models (LLMs) on moral reasoning. For DIT-2, Claude had the highest post-conventional score (P-score) of 72. This was followed by Gemini Advanced (P-score 64) and Gemini (P-score 58). The other LLMs' performance was listed as follows: Grok (P-score 48), ChatGPT-4O (P-score 44), ChatGPT-4 (P-score 44), and ChatGPT-3.5, which had the lowest scores (P-score 18). The results indicate that LLMs can simulate high levels of moral reasoning. For the ICM Educational Leaders version, Gemini Advanced had the highest total ICM score of 0.90. It was followed by Gemini (0.86), Claude (0.86), ChatGPT-4O (0.78), ChatGPT-4 (0.78), Grok (0.61), and ChatGPT-3.5 (0.32).

Authors