Individual Submission Summary
Share...

Direct link:

Evaluating Large Language Models as Judicial Decision-Makers

Fri, September 5, 3:30 to 4:45pm, Deree | Classrooms, DC 607

Abstract

Large Language Models (LLMs) are increasingly shaping various domains, yet their ability to align with human judgment remains a critical challenge. This study explores the extent to which LLMs can serve as judicial decision-makers by analyzing a dataset of over 123 judicial decisions made by retired judges on two fictional cases involving rape and violence and comparing them to decisions of LLMs. We evaluate multiple LLMs, GPT, Gemini, and Claude, using three prompting strategies: zero-shot, few-shot, and chain-of-thought. Our findings reveal that LLMs exhibited greater consistency in sentencing across both cases, meaning they produced lower sentence disparity compared to human judges. In the violence-related case, all LLMs generated sentencing decisions that were statistically similar to those of human judges. However, in the rape-related case, LLMs consistently imposed harsher sentences in terms of prison years. Notably, only Gemini, when using the few-shot learning strategy, aligned its sentencing with human judges. The findings of this research have profound implications for evaluating whether LLMs can approximate human judgment in legal decision-making. The increased consistency of LLMs raises questions about the variability in human judicial decisions and whether AI-driven systems could potentially reduce sentencing disparities.

Authors