Search
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Committee or SIG
Browse By Session Type
Browse By Keywords
Browse By Geographic Descriptor
Search Tips
Personal Schedule
Change Preferences / Time Zone
Sign In
The quality of AI models, such as Large Language Models (LLMs), is intrinsically linked to the quality of the data they are trained on. We will present our efforts to develop a large-scale dataset of high-quality educational materials, focusing on foundational literacy and numeracy (FLN) for early grades (1-3) in a global context. This dataset is enriched with metadata, detailing the alignment of each piece of content with specific subjects, educational levels, and contextual relevance, according to a carefully considered global educational taxonomy. Our content curation strategy involves two stages. First, obtaining relevant open educational materials, and second, evaluating the quality of these materials and allocating the relevant metadata.
We pursue two complementary strategies to source educational materials for the data set. First, we work with global education experts to identify and obtain targeted high-quality educational resources —such as textbooks, lesson plans, and teacher guides, from established sources of recognized quality. Second, we filter existing web-scale LLM training data sets to extract only high-quality educational resources. These complementary approaches will ensure we address both issues of quality (from having a base of manually selected high-quality benchmark materials from the bottom-up process), and quantity (by automatically selecting high-quality material from web-scale datasets in a top-down selection process).
In parallel to the above, as part of the top-down filtering process, we develop models and systems to evaluate educational quality and classify individual pieces of content according to our metadata taxonomy. These models will be made openly available alongside the processed and annotated dataset for ongoing use by the community in other applications.
Curating high-quality, diverse educational content at scale will have a number of downstream benefits. This data set can be used for training smaller (and hence cheaper) models, or fine-tuning larger existing models for specific educational applications (e.g. fine-tuning models to produce assessments in an appropriate format and level). This will result in generative-AI outputs that are more accurate and aligned to the educational application and context. Most existing LLM training data sets are large collections of text, without much additional information beyond the URL address where each piece of content was found. Having detailed, validated metadata from a carefully thought out globally relevant educational taxonomy greatly enhances the potential uses of the data. For example, other machine learning EdTech technologies can be developed using this training data.
Content curation is essential for ensuring the quality and relevance of educational materials used to train AI models. We combine top-down filtering of large datasets with bottom-up curation of specific resources, to create a large-scale dataset of high-quality materials. Our hope is that this dataset will facilitate the development of AI tools that are both effective and transparent, meeting the needs of learners, particularly in LMICs. While our initial focus is on FLN in early grades, the processes and structures we have developed are adaptable, and data, code and models underlying the processing pipelines are open-source.