Search
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Committee or SIG
Browse By Session Type
Browse By Keywords
Browse By Geographic Descriptor
Search Tips
Personal Schedule
Change Preferences / Time Zone
Sign In
The field of educational AI products is expanding rapidly, but there remains a lack of benchmarks focusing on education and pedagogy. This leaves developers and users unclear about the best models to select for educational applications. The flexible and open-ended outputs produced by generative AI such as Large Language Models (LLMs), can be challenging to assess in a systematic way. Many benchmarking and testing approaches have been developed by the AI community to evaluate models, from batteries of multiple-choice questions designed to assess knowledge, to online head-to-head comparisons where users vote on the responses to their query from multiple models. However, in a comprehensive review of existing benchmarks, we did not find any benchmarks of pedagogical knowledge. Pedagogical knowledge includes not just what to teach (i.e. subject knowledge) but how to teach it effectively, which is of crucial importance for a range of educational applications.
To address this, we developed a Cross-Domain Pedagogical Knowledge (CDPK) benchmark. Inspired by influential and widely adopted LLM benchmarks like MMLU (which assesses model knowledge across a range of subject domains), CDPK uses multiple-choice questions to assess knowledge of pedagogy. An advantage of the multiple-choice approach is that it can easily be applied to any LLM (including smaller open models) to assess pedagogical knowledge with a simple numerical score. However, it is necessarily more abstract as it is not linked to active performance of a particular educational task. For this reason, we refer to it as a “meta-pedagogical” benchmark. This contrasts with approaches like Google’s LearnLM framework, which comprehensively assesses the actual interactive performance and practice of a learner-facing chatbot across multiple dimensions of pedagogy. However, we expect the pedagogical knowledge evaluated by CDPK to translate into a range of educational applications, particularly teacher facing tools. For example, an LLM that scores well in this benchmark would be expected to provide better advice on classroom practice for a teacher assistant chatbot, due to having deeper knowledge of the relevant pedagogical principles embedded in the model.
CDPK was developed based on multiple-choice questions from a range of teacher training materials and qualifications, as well as pedagogical textbooks. These were enhanced using current frontier models to generate a large set of questions in a systematic way. A balanced set of questions were obtained across pedagogical topic areas (informed by validated high-quality teacher training materials), and question types (relating to different levels of Bloom’s Taxonomy). Questions were validated in two steps, first using LLMs to evaluate the questions based on established MCQ best practices, and second using human experts (holding a graduate level educational qualification) to validate answer accuracy, ambiguity of wording and distractor answers plausibility. CDPK is openly available and applicable to any existing LLM. We will provide initial results for a range of current models. With an increasing focus on reducing costs through smaller models, and increased interest in open models, we hope that CDPK will be a useful tool for the EdTech community to evaluate LLMs for educational applications.