Session Submission Summary
Share...

Direct link:

Curating Educational Content

Mon, March 24, 2:45 to 4:00pm, Palmer House, Floor: 7th Floor, LaSalle 3

Group Submission Type: Formal Panel Session

Proposal

In the digital age, there is a huge and increasing amount of educational content openly available. However, content can be of variable quality, which can stifle adoption of open resources and limit their usefulness. As the capabilities of Generative AI (Gen AI) continue to expand, the importance of ensuring high-quality, relevant, and culturally sensitive content has become paramount. This is particularly important in educational applications. The effectiveness of Large Language Models (LLMs) - like ChatGPT - is heavily dependent on the quality of the data they are trained on. As the computational requirements to train and run the current state-of-the-art LLMs increase, there is growing attention on the quality, rather than just the quantity, of training data. For example, HuggingFace recently released the FineWeb-Edu dataset, which is a large dataset of web-scraped data which has been filtered for educational relevance. Initial results show that models trained on this educational data perform better than those trained on similar amounts of unfiltered internet data.

The quality of open educational materials is important, both for human users and for AI applications, but how to evaluate, and, even more importantly, define quality is an important open question. In this panel, speakers will talk about their work and experiences curating educational content, as well as building from that curated data. Our speakers will cover their experience of sourcing educational material in different ways and with different focuses. Speaker 1 focuses on teaching materials for the UK school curriculum, speaker 2 sources open educational materials from a broad range of sources for an international audience, and speaker 3 will talk about work to filter existing web-scale datasets for high-quality educational content. Issues around sourcing content include organizing and hosting the data, but also considering the licensing of the materials.

In addition to quality, it is also critical to consider alignment of content to some pedagogical taxonomy, for example, of subjects, material type (lesson plan, reading, assessment), grade level, educational standard and so on. How to define such a taxonomy, particularly in a global context is an open question. Another challenge is how to evaluate the alignment of content to a taxonomy, at scale. Large Language Models (LLM’s) provide a powerful tool to address this challenge. A common theme among the speakers will be how quality and alignment can be evaluated with LLMs. Speaker 1 has used LLM-as-a-judge techniques to evaluate AI generated lesson plans on multiple aspects of quality and alignment. Speaker 2 uses AI and machine learning methods to digitize curriculum standards and to match relevant content items for a large library to particular learning objectives from a curriculum. Speaker 3 will talk about developing a global taxonomy for educational metadata accompanied by labelling models and a large-scale annotated data set.

Finally, our panel will cover their experience of use cases for high-quality curated educational data. Speaker 1 used a curated set of high-quality UK-focused educational material to develop a generative AI tool to help teachers generate high-quality lesson plans which are carefully aligned to UK curricula. Speaker 2 uses a data set of open education resources to provide offline access to high-quality educational materials to learners around the world. Speaker 3 will describe a large-scale annotated educational data set that can be used for fine-tuning LLM models for educational applications, as well as training data for developing new edtech tools in different contexts.

In the age of AI data is more important than ever. This panel will discuss issues around curating educational materials at scale, as well as some of the exciting applications this enables now and potential applications this will enable in the future.

Sub Unit

Organizer

Chair

Individual Presentations

Discussant