Paper Summary
Share...

Direct link:

Evaluating Instructional Alignment in LLM-Generated Math Tasks Using Expert Teacher Annotations

Thu, April 9, 2:15 to 3:45pm PDT (2:15 to 3:45pm PDT), Westin Bonaventure, Floor: Lobby Level, Palos Verdes

Abstract

As large language models (LLMs) enter classroom applications, their instructional alignment with educational standards remains a critical but undermeasured dimension of trustworthiness. This study evaluates the pedagogical validity of GPT-4, Mixtral, and LLaMA-2 by analyzing 300 elementary math problems against U.S. Common Core standards, using expert annotations from the MathFish benchmark. GPT-4 demonstrates substantially higher alignment (53.7%) compared to Mixtral (34.3%) and LLaMA-2 (18.4%), with superior F1 scores and consistent classification performance. Confusion matrix analysis reveals systematic issues in open models, including off-grade content and conceptual drift. We propose instructional alignment as a stakeholder-centered trust metric for educational LLMs, with implications for responsible model deployment in high-stakes learning environments.

Authors