Individual Submission Summary
Share...

Direct link:

AI evaluations lessons and tools: a cross-sectoral perspective.

Sun, March 29, 2:45 to 4:00pm, Hilton, Floor: Fourth Floor - Tower 3, Union Square 18

Proposal

IDinsight brings a unique cross-sectoral perspective to the field of AI evaluation. Their work spans diverse domains - citizen-facing services, frontline health worker support, education technology, and access to government benefits, offering a vantage point into both the promise and pitfalls of AI in real-world development contexts.

Over the last three years, IDinsight has launched several large-scale AI-powered solutions. In South Africa, a conversational agent responds to over 40,000 maternal and child health queries each month, providing timely and trustworthy information directly to mothers. In Ethiopia, IDinsight designed and deployed an AI assistant for Community Health Workers, supporting them in diagnosis and adherence to clinical protocols. In India, IDinsight built an AI agent to help young people navigate apprenticeships and government benefits in their native tongue. Currently, IDinsight is developing AI-enabled tools to support structured pedagogy in Senegal.

Across these efforts, evaluation has been central not just to measure impact ex post, but to guide design choices, flag risks, and inform scaling strategies. Three lessons have emerged as particularly salient:

Building representative “golden” datasets is costly but indispensable.
Evaluations of AI systems depend on benchmark datasets that faithfully capture the linguistic, cultural, and contextual realities of end-users. Constructing these datasets, whether code-switched maternal health queries in South Africa, Amharic clinical vignettes, or apprenticeship application questions in Hindi and Marathi, requires sustained investment. The costs are not only financial but organizational: engaging domain experts and curating diverse samples.

Synthetic data has real limits.
While synthetic data generation can accelerate dataset creation, IDinsight’s work shows it cannot fully substitute for authentic user data. Synthetic examples often lack the nuance, ambiguity, and contextual cues embedded in real interactions, which are precisely the features that test an AI system’s robustness. Synthetic data is useful for prototyping and stress-testing, but evaluations must ultimately rely on high-quality, real-world data to be meaningful.

Evaluation is not a one-off event but a continuous process.
AI systems evolve in response to both user behavior and external conditions. A model performing well at launch may degrade as usage scales, as languages shift, or as the surrounding information ecosystem changes. IDinsight has therefore emphasized building pipelines for ongoing monitoring: feedback loops that capture user satisfaction and feedback, analyzing engagement data for new emerging topics, and dashboards that track performance drift over time.

Learning, iterating, and continuously improving should be core to how every non-profit operates. Unfortunately, experimentation is often too onerous and expensive. In collaboration with The Agency Fund, IDinsight launched Evidential - a free, open-source tool that enables NGOs and public sector partners to run rigorous A/B tests in real-world settings. Evidential helps implementers quickly learn what works, for whom, and under what conditions.

IDinsight’s experience underscores that robust evaluation is not ancillary to AI development but foundational. By investing in representative datasets, recognizing the boundaries of synthetic data, and institutionalizing continuous monitoring, the sector can build AI systems that are not only technically impressive but also trustworthy, accountable, and impactful.

Author