Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
As large language models (LLMs) enter classroom applications, their instructional alignment with educational standards remains a critical but undermeasured dimension of trustworthiness. This study evaluates the pedagogical validity of GPT-4, Mixtral, and LLaMA-2 by analyzing 300 elementary math problems against U.S. Common Core standards, using expert annotations from the MathFish benchmark. GPT-4 demonstrates substantially higher alignment (53.7%) compared to Mixtral (34.3%) and LLaMA-2 (18.4%), with superior F1 scores and consistent classification performance. Confusion matrix analysis reveals systematic issues in open models, including off-grade content and conceptual drift. We propose instructional alignment as a stakeholder-centered trust metric for educational LLMs, with implications for responsible model deployment in high-stakes learning environments.