Browse By Day
Browse By Time
Browse By Panel
Browse By Session Type
Browse By Topic Area
As observational teacher-quality tools increasingly become part of the early childhood high-stakes accountability landscape, it is important to consider the appropriateness of such measures for accountability. Increasingly, empirical evidence has identified concerns around the reliability of observational tools such as the CLASS (i.e., Mashburn, 2017). Although such tools are important for reflective professional development (Early et al., 2017), evidence suggests that they may not be appropriate for evaluative purposes.
The goal of the current study is to evaluate the consistency of classroom quality observations using the Classroom Assessment Scoring System (CLASS; Pianta, La Paro, & Hamre, 2008). This study draws from two sets of observations conducted by separate groups of trained, certified reliable observers, at different time periods, in 56 charter preschool classrooms in the 2017-2018 school year. One set was collected for professional development purposes and the other for high-stakes accountability purposes. (see Table 1 for information on Observer Groups). By assessing reliability between these two sets of observations we are able to inform the larger conversation about how observational quality measures should be used.
Scores were compared to identify overall agreement, using the plus or minus 1-point reliability metric, as well as classroom level reliability, applying the 80% rater agreement threshold to assess acceptable agreement between raters. On average, reliability between observers was 71%, ranging from 40% to 100% across classrooms and 50% to 88% when aggregated at the school level. Just 52% of classrooms met the 80% reliability threshold, ranging from 0% to 100% within each of the ten participating schools (see Table 2 for additional school-level information).
The implications of these results are multifaceted. As all observers achieved reliability through the same process, we cannot make assumptions about the quality of one group of observers over another. The larger takeaway is that there was not agreement, not which set of scores were “correct.” Variance within observers likely accounts for some degree of measurement error. The most obvious explanation for this lack of agreement is that the two sets of observations took place on different days at different times of the year, likely resulting in different experiences for the observers. However, if these ratings are being used to rank teachers and tier schools, it is imperative that we capture the most accurate representation of teacher quality. When one set of these ratings is used for accountability purposes we cannot feel confident that the classroom scores are accurate or holistic representations of what quality looks like on a daily basis.
As a field we must have a better understanding of how these observations function over time. For example, an important next step is learning more about the timing of observations and the number of observation cycles needed to obtain an accurate estimate of quality. Possible solutions may be to include multiple observation cycles, multiple ratings over the course of the school year, use multiple raters for one observation, or consider growth over the year as opposed to an average score.