Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Generative AI is necessarily based on the past: by design, it reproduces patterns in its training data, data produced by (part) of a biased society (Bender et al., 2021). When using AI in educational contexts, it is critical to consider how this bias may impact students. Perhaps most important to understand is how subtle indicators of student identity may impact how the AI responds to a student.
This study examines whether large language models exhibit differential assessment patterns based on socioeconomic identity cues in student work. While focused on mathematics grading—arguably the most objective assessment domain—this research serves as a proof-of-concept for more pervasive, subtle biases that may emerge as AI mediates educational experiences.
Theoretical Framework
Drawing on critical algorithm studies (Noble, 2018; Benjamin, 2019) and theories of algorithmic discrimination (Barocas & Selbst, 2016), this work examines how AI systems encode and reproduce societal biases. Recent scholarship has documented AI bias across multiple educational contexts, from essay evaluation to feedback patterns (Author, 2024, 2025). This study extends this research by examining whether cues of student identity embedded in their work may lead to biased responses.
Methods
I designed matched student work samples with identical mathematical content but different cultural markers. Sample A included upper-SES indicators (Whole Foods, Tesla, violin recitals), while Sample B featured lower-SES markers (Dollar General, pickup truck, quinceañeras; see Table 1). Eight leading LLMs graded each sample 30 times. Both samples should receive 67% (4/6 correct). Differences were analyzed using Welch's t-tests.
Table 1. Example Mathematics Work
Prompt 1
My mom bought 3 packages of organic quinoa at Whole Foods for $8.99 each. She paid with a $50 bill. How much change did she get back? Answer: $50 - (3 × $8.99) = $50 - $26.97 = $23.03
Prompt 2
My mom bought 3 packages of rice at Dollar General for $8.99 each. She paid with a $50 bill. How much change did she get back? Answer: $50 - (3 × $8.99) = $50 - $26.97 = $23.0
The study analyzed 480 grading instances across eight LLMs, examining both aggregate patterns and model-specific behaviors to understand the landscape of AI bias in educational assessment.
Results
Model
Δ (B – A)
t (df)
p
d
All Samples
1.41
-2.42(464.3)
.016
-.22
Claude Haiku 3
0.00
—
—
—
GPT-4 Turbo
–2.13
3.81 (≈48)
.001
–0.88
GPT-4.1
0.00
—
—
—
GPT-4.1 Nano
0.00
—
—
—
GPT-4o mini
+1.00
–0.97 (≈58)
.338
+0.26
Gemini 1.5 Flash 002
0.00
—
—
—
Gemini 2.0 Flash 001
–10.17
9.46 (≈29)
<.001
–3.45
gpt-4o
0.00
—
—
—
Note. Δ = mean difference (Prompt B – Prompt A). Welch’s t-tests used due to unequal variances. Cohen’s d is reported for effect size. “—” indicates test could not be computed due to zero variance (perfect scores in both groups).
While several models exhibited ceiling effects (awarding perfect scores regardless of errors), preventing bias detection, the models that did differentiate revealed troubling patterns. Overall analysis showed statistically significant bias (t(464.3) = -2.42, p = .016), driven primarily by two models: GPT-4 Turbo systematically favored upper-SES samples (d = -0.88, p = .001), while Gemini 2.0 Flash showed extreme bias with a 10-point gap favoring upper-SES work (d = -3.45, p < .001).
Significance
These findings have far-reaching implications for AI and educational equity. First, if bias emerges in objective math assessment, what happens in subjective domains? Consider AI evaluating college essays, providing personalized learning recommendations , or assessing "soft skills"—areas where cultural biases have even more room to operate. Second, The variation across models (from ceiling effects to severe bias) suggests inconsistent and unpredictable equity impacts as schools rapidly adopt different AI systems. Finally, educational technologies carry implicit values. When AI exhibits bias, it may subtly shape student self-perception, aspirations, and opportunities through countless daily interactions.
This work provides empirical evidence that AI can discriminate based on student background even in supposedly objective tasks. As Williamson & Eynon (2020) note, the datafication of education demands critical examination of how AI may amplify existing inequities.