Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Objective. Recent advances in computer vision, machine learning, and deep learning offer potential solutions to challenges involved in using video at scale in research ((Demszky et al., 2025; Jacobs et al., 2024; Kelly et al., 2025). This study drew on annotated video data and labeled audio transcript data from 50 hours of videos of elementary mathematics and reading instruction to assess the capacity of a multimodal neural network model (i.e., a transformer network) to automatically classify instructional activities in the videos.
Perspective. We introduce a refined multimodal activity recognition pipeline that addresses limitations of existing multimodal activity recognition methods. First, our method leverages both aligned and non-aligned multimodal learning. While primarily non-aligned to capture complex and non-synchronized patterns in videos, the proposed pipeline also incorporates temporal correspondence between visual and audio modalities, an advantage of aligned-based approaches, to enhance multimodal feature fusion. Second, we introduce a gated visual-audio fusion mechanism emphasizing only pertinent multimodal corresponding dependencies, eliminating irrelevant fusion features.
Third, instead of relying on entire video frames, we extract visual semantics that correspond to audio cues. Fourth, our transformer network is designed to recognize causal relationships in multimodal learning. Finally, our method incorporates inter-modal and intra-modal attention mechanisms to capture interdependencies between video and audio while also independently capturing audio and visual patterns.
Data and Methods. Our analysis utilized videos of elementary instruction from the Development of Ambitious Instruction study; for that study, 80 elementary teachers were observed teaching mathematics and reading six times each for about 45 minutes per lesson. Thus, we had access to 700 hours of video of instruction. For the analysis reported here, we used 50 hours of video from this dataset.
In this analysis, we developed 24 video annotation labels and 10 audio transcript labels. For example, the video labels included whole group instruction, small group instruction, individual student work, and transitions while the audio labels included cognitive demand of tasks, teacher questioning, explanation and justification, feedback, and uptake. Our team viewed 50 hours of videos and assigned video and audio labels to the videos; 70% of the labeled videos and transcripts were used to train the transformer network to classify activities; and 30% of the labeled videos and transcripts were used to assess network accuracy.
Findings. We found that the transformer network was more accurate in classifying video label constructs (e.g., whole group instruction, small group instruction) and audio label constructs (e.g., questioning, feedback, uptake) than neural network models trained with only annotated video data or only label audio transcript data. We report the results in terms of F1 scores (see Tables 1 and 2).
Significance. This study evaluated multiple types of neural networks for automated classification of instructional activities in classroom videos. Results shed light on the accuracy of a multimodal transformer network and the possibility of using these methods on a large scale. Along with making significant gains in efficiency while significantly reducing costs, this work has the potential to yield new developments in teaching metrics and autonomous classroom simulations.