Name: CIES 2026 Annual Conference
Start: 2026-03-28T00:00:00-07:00
End: 2026-04-02T00:00:00-07:00

Information Menu
Search Tips

Back Home

Refresh: Off View Personal Schedule

Individual Submission Summary

Share...

Direct link:

Developing Education-Focused AI Benchmarks

In Event: Navigating Education in the Age of AI: Insights from AI Observatory’s approach to evidence generation and guidance to decision makers

Wed, April 1, 1:15 to 2:30pm, Hilton, Floor: Fourth Floor - Tower 3, Union Square 21

Proposal

Ministry staff, donors and other decision-makers are increasingly being inundated with opportunities to introduce new AI-EdTech tools into schools, but how can they be confident that the AI outputs are good? AI benchmarks are one aspect of quality assurance – within the wider quality assurance framework - which focus on the quality of AI outputs.

Education focused AI Benchmarks can help policymakers better understand whether AI systems work may be a good fit for education systems and whether AI outputs can meet the needs of their users. The AI Observatory, in its role of supporting decision-makers in low-and middle-income countries with distilled practical insights, has partnered with AI-for-Education.org on developing educational AI benchmarks.

AI benchmarks are like an exam for AI systems, designed to assess a specific skill in a standardized way, resulting in a score that allows for comparison between systems. A benchmark consists of a problem specification, a dataset, and a defined score. Correct answers are often referred to as the ground truth.

Most LLM benchmarks focus on verbal reasoning to test models’ language and logic skills. But their visual reasoning abilities are much less explored—despite growing evidence of weaknesses. For example, our earlier developed Visual Maths Benchmark found that even advanced models struggle with image-based early grade maths problems.

This is a big problem for foundational learning which often depends on visuals. Teaching foundational numeracy follows a “Concrete, Pictorial, Abstract” methodology. Students first use physical objects known as manipulatives like counters, blocks, beads, or cubes to represent mathematical ideas. Once children are comfortable with concrete materials, they transition to drawings, diagrams, and visual models. The final stage involves using mathematical symbols and numbers without the need for physical or visual aids. AI to support this work effectively, it must handle a wide range of visual tasks. Accuracy really matters to ensure students get good feedback, misconceptions aren’t reinforced and trust in tools is earned.

Our Visual Reasoning Benchmark incorporates multiple choice questions from end-of-primary non-verbal, cognitive aptitude assessments used in Zambia and India. They involve visual tasks such as pattern recognition, matching and spatial reasoning, like the well-known Raven’s Progressive Matrices.

We find that AI models struggle significantly with these visual reasoning questions, and with a broad range in performance. This can be used to highlight these challenges and show that further efforts are needed from big tech in this area, as well as exactly what types of questions they are failing at. This is shown in a public leaderboard, enabling developers to compare which models to use in their products. We also include cost and size filters to help developers identify the most suitable choice for their context, particularly relevant for LMICs. Decision-makers can take this evidence as caution for the use of AI tools in aspects of visual reasoning, where teachers and students may not be able to rely on the outputs as confidently as in other educational contexts.

Developing Education-Focused AI Benchmarks

Proposal

Authors