Search
Browse By Day
Browse By Time
Browse By Person
Browse By Committee or SIG
Browse By Session Type
Browse By Keywords
Browse By Geographic Descriptor
Search Tips
Personal Schedule
Change Preferences / Time Zone
Sign In
Group Submission Type: Formal Panel Session
The recent step change in the abilities of Artificial Intelligence (AI) brings huge opportunities for education globally, as well as bringing new challenges. The cost of generating text, speech and video have fallen dramatically – so although knowledge has never been so accessible, there is a high risk that education systems will be overwhelmed by large volumes of low-quality content and poor digital pedagogy. How can governments, developers and implementers be supported to quality assure AI-powered solutions that are intended for use by students, teachers and other users in the world of AI-EdTech?
This is the first in a series of two linked panels focused on ensuring quality and standards in AI-EdTech. In this first panel, we showcase examples of testing and measuring quality from across the product development cycle, and in the second panel we build on this to see how such examples can inform and be guided by international quality standards in AI-EdTech.
AI-for-Education.org’s quality assurance framework for AI solutions covers four aspects towards an evidence-informed product development cycle: from the development of the solution; to how this gets in the hands of users; understanding the user experience; and measuring the impact. In the first aspect of solution development, we include the measurement of the quality of the AI components. Here we follow the wider AI developer community and further differentiate between ‘benchmarks’ – which are generally use-case agnostic tests of knowledge or ability – and ‘evaluations’ – which tie the quality assurance to specific tasks such as generating grade-3 reading passages.
Further on in the quality assurance cycle, measuring impact is complicated by the long duration and high costs of traditional RCTs which means that these can often be unsuitable for the fast-changing environment of AI solutions, where new models are released regularly. One key area is designing methods of rapid-efficacy evaluation that are affordable and financially feasible for EdTech companies or donors to support, that report in a time frame that is more aligned to AI cycles.
Drawing on the CIES 2026 theme of "Re-examining Education and Peace in a Divided World", this panel showcases the experience of organisations ranging from developers of locally-tailored AI solutions, through to international organisations developing benchmarks, supporting evaluations and guiding methods of evaluating efficacy more rapidly.
In Afghanistan, Lapis developed a tutor chatbot delivered by WhatsApp to complement their existing work on curriculum-based TV shows, radio broadcasts and digital learning platforms offering a range of remote learning modalities. In order to develop an appropriate, usable and safe chatbot for Afghanistan, Lapis needed to meet a number of quality criteria, to ensure: safety for girls and for the implementing organisation; a cultural and linguistic fit; curriculum-alignment and relevance for learners’ study journey; pedagogical soundness; and student access despite bandwidth constraints. Safety, privacy and sensitivity all brought unique challenges within the Afghanistan context. Lapis therefore relied on several metrics to measure quality and effectiveness. It applied benchmarks to ensure sufficient quality in the AI model selection, including sufficient proficiency in the local Dari and Pashto languages. Moreover, task-specific evaluations, digital analytics and user feedback were all used as part of a process to ensure quality throughout the product development cycle.
AI-for-Education.org is working to support quality assurance of AI-powered education solutions across the product development cycle. We have developed benchmarks with public leaderboards on key educational aspects of pedagogy and visual reasoning. One recent area of research has looked at the gender bias of AI-generated children’s stories. This looks into areas of bias related to overall frequency of representation, as well as aspects of professions (are certain professions more frequently represented by one gender) and adjectives (are certain types of adjectives more frequently used for one gender). This is a public tool able to be applied for use by developers to also evaluate outputs within their own solutions. We also present emerging findings on the application of rapid-efficacy studies to more appropriately measure impact in the fast-changing world of AI.
Language, Reasoning and Education Lab at ETH Zurich have similarly been applying the use of LLMs to the challenges of providing high quality pedagogical responses through Intelligent Tutoring Systems. Tutoring is an open-ended task requiring a tutor to a) detect learner mistakes and provide targeted hints, b) infer the learner’s evolving knowledge, and c) generate new problems of suitability difficulty for the learner. The use of LLMs brings huge potential into this sphere, but does not meet some of the key challenges, with LLMs particularly struggling to provide long-term evolving support that adapts to their precise needs and sensibly guides what’s next. ETH Zurich have therefore developed both automated benchmarks such as MathTutorBench, as well as both automatic and human evaluations, to measure the performance of AI-powered tutors as well as to iterate improvements in their development and use in university and middle-school contexts.
IDinsight brings a cross-sectoral perspective on evaluating AI systems deployed in real-world development contexts. The organization has built AI-powered solutions at scale: a maternal health chatbot in South Africa answering 40,000 monthly queries; an AI assistant for Community Health Workers in Ethiopia; an agent helping Indian youth access apprenticeships and government benefits; and most recently, tools for generating multilingual textbooks and supporting teachers with structured pedagogy in Senegal. From these experiences, three key lessons emerge. First, representative “golden” datasets are essential yet costly. Second, while synthetic data can accelerate early testing, it cannot substitute for authentic user data. Third, evaluation must be continuous, with systems in place to monitor real-world performance, biases, and drift. To reduce the barriers to continuous learning and improvement, IDinsight and The Agency Fund launched Evidential, a free, open-source tool that enables NGOs and public sector partners to run rapid, real-world A/B tests.
Chairing and discussant inputs from Central Square Foundation and International Centre for EdTech Impact will link how these practical examples of testing and measuring quality from across the product development cycle, can tie into and inform international quality standards in AI-EdTech.
Safe and Effective AI for Learning: A Chatbot for Afghanistan - Alex Thier, Lapis
Quality Assurance in AI including Benchmarking of Gender Bias - Alasdair Mackintosh, AI-for-Education.org
Evaluating and Improving LLMs on Pedagogy - Mrinmaya Sachan
AI evaluations lessons and tools: a cross-sectoral perspective. - Marc Shotland, IDinsight