Search
Browse By Day
Browse By Time
Browse By Person
Browse By Committee or SIG
Browse By Session Type
Browse By Keywords
Browse By Geographic Descriptor
Search Tips
Personal Schedule
Change Preferences / Time Zone
Sign In
Ministry staff, donors and other decision-makers are increasingly being inundated with opportunities to introduce new AI-EdTech tools into schools, but how can they be confident that the AI outputs are good? AI benchmarks are one aspect of quality assurance – within the wider quality assurance framework - which focus on the quality of AI outputs.
Education focused AI Benchmarks can help policymakers better understand whether AI systems work may be a good fit for education systems and whether AI outputs can meet the needs of their users. The AI Observatory, in its role of supporting decision-makers in low-and middle-income countries with distilled practical insights, has partnered with AI-for-Education.org on developing educational AI benchmarks.
AI benchmarks are like an exam for AI systems, designed to assess a specific skill in a standardized way, resulting in a score that allows for comparison between systems. A benchmark consists of a problem specification, a dataset, and a defined score. Correct answers are often referred to as the ground truth.
Most LLM benchmarks focus on verbal reasoning to test models’ language and logic skills. But their visual reasoning abilities are much less explored—despite growing evidence of weaknesses. For example, our earlier developed Visual Maths Benchmark found that even advanced models struggle with image-based early grade maths problems.
This is a big problem for foundational learning which often depends on visuals. Teaching foundational numeracy follows a “Concrete, Pictorial, Abstract” methodology. Students first use physical objects known as manipulatives like counters, blocks, beads, or cubes to represent mathematical ideas. Once children are comfortable with concrete materials, they transition to drawings, diagrams, and visual models. The final stage involves using mathematical symbols and numbers without the need for physical or visual aids. AI to support this work effectively, it must handle a wide range of visual tasks. Accuracy really matters to ensure students get good feedback, misconceptions aren’t reinforced and trust in tools is earned.
Our Visual Reasoning Benchmark incorporates multiple choice questions from end-of-primary non-verbal, cognitive aptitude assessments used in Zambia and India. They involve visual tasks such as pattern recognition, matching and spatial reasoning, like the well-known Raven’s Progressive Matrices.
We find that AI models struggle significantly with these visual reasoning questions, and with a broad range in performance. This can be used to highlight these challenges and show that further efforts are needed from big tech in this area, as well as exactly what types of questions they are failing at. This is shown in a public leaderboard, enabling developers to compare which models to use in their products. We also include cost and size filters to help developers identify the most suitable choice for their context, particularly relevant for LMICs. Decision-makers can take this evidence as caution for the use of AI tools in aspects of visual reasoning, where teachers and students may not be able to rely on the outputs as confidently as in other educational contexts.