Paper Summary
Share...

Direct link:

Fairness vs. Overcorrection: Investigating Input Effects in AI-Powered Writing Assessment

Wed, April 8, 11:45am to 1:15pm PDT (11:45am to 1:15pm PDT), Los Angeles Convention Center, Floor: Level Two, Poster Hall - Exhibit Hall A

Abstract

This study examines how demographic inputs and rubric-based criteria affect the fairness of AI-generated writing scores. Using data from 652 adolescents in a randomized trial, we analyzed writing scored by ChatGPT-4o and DeepSeek-V2 under four input conditions: response-only, response+demographics, response+rubric, and response+both. A five-criteria rubric guided scoring; simulated socioeconomic and religious data allowed broader bias testing. Demographic cues led to higher scores for several groups, including transgender and LGBQ+ youth, suggesting potential leniency. Rubric-based scoring lowered scores, reflecting increased consistency. When both inputs were used, effects offset each other. ChatGPT was more input-sensitive; DeepSeek more reactive to demographics. These findings highlight the evolving nature of AI bias and underscore the need to separate scoring tools from identity-sensitive inputs for equitable assessment.

Author