Individual Submission Summary
Share...

Direct link:

Quality Assurance in AI including Benchmarking of Gender Bias

Sun, March 29, 2:45 to 4:00pm, Hilton, Floor: Fourth Floor - Tower 3, Union Square 18

Proposal

AI-for-Education.org’s quality assurance framework for AI solutions covers four aspects towards an evidence-informed product development cycle: from the development of the solution; to how this gets in the hands of users; understanding the user experience; and measuring the impact. In this presentation we showcase examples from across this cycle.

In the first aspect of solution development, we include the measurement of the quality of the AI components. We have developed benchmarks with public leaderboards on key educational aspects of pedagogy and visual reasoning. One recent area of research has looked at the gender bias of AI-generated children’s stories.

We set out to understand the gender bias and stereotypes exhibited in educational materials generated by LLMs. We evaluate a specific educational use-case: generating short fictional narrative stories for use in assessment of reading comprehension, loosely aligned to the grade-level standards of the Global Proficiency Framework for Reading (GPF). We ask models to generate texts appropriate for specific grades, which span from two (short, simple stories) through to nine (longer, more complex stories). We use a large number of prompts to enable statistically robust analyses of the gender balance of the characters which appear in these texts. This method helps find both general patterns of bias and the small details or specific examples of bias in different parts of the text. We build on our benchmarking systems to test a wide variety of models, allowing us to look across models at different price points and from different companies.

This looks into areas of bias related to overall frequency of representation, as well as aspects of professions (are certain professions more frequently represented by one gender) and adjectives (are certain types of adjectives more frequently used for one gender).Our results find notable differences across AI models, as well as across professions. These results can inform AI model choice by developers, as well as highlight deficiencies for big tech to resolve. Moreover, as results can be sensitive to the prompting used, this has been developed as a public tool able to be applied for use by developers to also evaluate outputs within their own AI-powered solutions.

Another key challenge is that many EdTech tools lack robust evidence of their impact on learning outcomes, which is often at odds with policy makers’ wishes for evidence-based education policies and tech procurement. Together, the long duration and high costs make traditional RCTs incompatible with the fast-changing environment of AI solutions, where new models are released regularly. At AI-for-Education.org we have been designing studies that are affordable and financially feasible for EdTech companies or donors to support, that report in a time frame that is more aligned to AI cycles.

AI-for-Education.org is a community to ensure equitable access and benefits from AI in Education. This work on benchmarks is done in partnership with EdTech Hub’s AI Observatory. The AI Observatory exists to help ensure equitable learning for all in the age of AI, and is made possible by support from UK International Development.

Author