Paper Summary

Direct link:

Using Machine Learning to Score Tasks That Assess Three-Dimensional Science Learning

Sat, April 18, 10:35am to 12:05pm, Virtual Room


This study uses machine learning to score assessment tasks previously developed to measure three-dimensional science learning (TDSL; NGSS, 2013; NRC, 2012). The TDSL tasks are open-ended, constructed response formats that were designed to support teachers’ instruction. However, working with teachers we have found that the tasks are time-consuming to score, which makes them challenging for teachers to actually use the assessment results to inform instruction. We thus employed a machine scoring approach that could generate scores rapidly (Lee et al., 2019). In this paper, we outline how automated scoring performed on the multi-part assessment tasks measuring TDSL.
Theoretical Perspectives
Using evidence-centered design, we developed over 100 tasks targeting TDSL (Authors, 2019). The cognitive, instructional, and inferential validity were considered at the very beginning of the item development (Author, 2018), but we did not intend to develop the items for automated scoring. Therefore, we do not know whether using automated scoring will contaminate the test validity as there might be unique challenges in automated scoring assessment (Clauser, Kane, & Swanson, 2002). We adopt Kane’s (2004) point--only the inferences and assumptions in the interpretive arguments that are highly questionable deserve scrutiny and require extensive empirical support. Because validity evidence on the TDSL assessments scored by humans has been reported earlier (Authors, 2018), computer scoring might be the most tenuous factor impacting their validity. We thus focus validation on these automated scores, adopting an existing framework (Williamson, Xi, & Breyer, 2012).
Student responses to TDSL assessment tasks were collected during a prior study (Authors, 2018). Since automated scoring was not considered when developing the tasks, we consider this analysis a proof-of-concept that these existing assessment tasks can be automatically scored using the existing rubrics. The TDSL tasks were completed by more than 700 middle school students in science class and were dual-coded by trained experts with sufficient inter-rater reliability. We developed and implemented supervised machine learning algorithms that can be applied on an item-by-item basis, which use a text classification approach to assign a score to each response (Aggarwal & Zhai, 2012).
Preliminary Results
Given assessment tasks are meant as low-stakes tasks, primarily to drive instructional decisions, we evaluated our results against three relevant criteria that Williamson and colleagues’ (2012). We present one of our items and the paper will present results based on more items.
Human scoring process and score quality. Interrater reliability between humans is Cohen’s Kappa (weighted) = .95. This is acceptable, indicating high agreement between raters.
Agreement of automated scores with human scores. Interrater reliability of computer-humans is Kappa= .80; Accuracy= .93. This Kappa value exceeds the minimum threshold (.70).
Degradation from human-human score agreement. The “degradation” moving from human-human scoring to automated scoring is .15. This is greater than the maximum degradation recommended (.10).
The findings showcase the potential of applying machine learning to automate scoring of the existing TDSL assessments tasks and provide further validity evidence that automated scoring is a feasible solution and might help teachers formatively assess their students.