Paper Summary
Share...

Direct link:

Validating Automated Teaching Effectiveness with Multimodal Data

Thu, April 9, 2:15 to 3:45pm PDT (2:15 to 3:45pm PDT), Westin Bonaventure, Floor: Lobby Level, Santa Barbara C

Abstract

Objective. High-quality teaching is central to student learning; however, traditional assessments of teaching effectiveness—such as classroom observations and surveys from both students and teachers—face limitations in terms of scalability, reliability, and validity. This study aims to explore the potential of machine learning (ML) algorithms applied to multimodal classroom data (i.e., text, audio, video) to automate the assessment of 18 subdimensions of teaching effectiveness, covering classroom management, student support, and cognitive activation. We address three research questions: (1) Can ML-generated scores achieve reliability comparable to human observer ratings? (2) Are these scores content-valid? (3) Do they predict student learning outcomes?
Framework. Grounded in the Three Basic Dimensions framework (Praetorius et al., 2018), we conceptualize teaching effectiveness in terms of domain-general, pedagogical interactions rather than subject-specific practices. We integrate theoretical models of teaching as opportunities to learn with emerging perspectives from multimodal learning analytics and automated feedback systems. Our methodological approach also builds on research emphasizing the importance of fine-grained, segment-level observation protocols to capture teaching practices in ecologically valid settings.
Data and Methods. We utilized data from the German sample of the OECD’s Global Teaching Insights study (N = 46 teachers, N = 1,132 students, comprising 92 video-recorded math lessons), which included video, audio, and transcript segments (Klieme et al., 2023; OECD, 2020). Human observers scored each 16-minute segment using 18 standardized subdimensions of teaching effectiveness. We trained attention-based ML models using pretrained encoders for each modality, combined via cross-modal attention and multitask learning, to predict these scores (Hou et al., 2025). Reliability was evaluated via absolute accuracy, relative accuracy, and bias (Schraw, 2009). Content validity was assessed through blinded plausibility judgments by trained human raters. Predictive validity was assessed through structural equation modeling, examining the associations with students’ tested achievement in mathematics.
Results. ML-generated scores outperformed human ratings in absolute accuracy for 11 of 18 subdimensions and matched human reliability in discourse-related dimensions such as questioning and explanation. Content validity analyses indicated that ML-generated scores were judged as at least as plausible as human ratings, with raters unable to consistently distinguish between them. Predictive validity findings were mixed: ML-generated scores based on text and audio predicted student achievement for several subdimensions (e.g., routines, cognitively demanding tasks); however, adding modalities (e.g., video) did not improve predictions and occasionally introduced noise. Some unexpected negative associations (e.g., warmth, alignment) suggest the need for nuanced model interpretation and potential limitations in the operationalization of constructs.
Significance. This study provides strong initial evidence that multimodal ML can approximate or exceed human scoring of teaching effectiveness subdimensions, offering scalable, cost-effective, and consistent alternatives to traditional assessments. Whereas challenges remain—particularly in accounting for context-sensitive behaviors and ensuring interpretability—these findings advance the conversation on how AI can support teacher feedback, professional development, and educational research. Our work contributes to both theoretical models of teaching quality and the methodological toolkit for classroom analytics, laying a foundation for real-time, automated feedback systems grounded in validated constructs of effective teaching.

Authors