Paper Summary
Share...

Direct link:

Developing Validity Evidence for the Writing Rubric to Inform Teacher Educators

Tue, April 21, 10:35am to 12:05pm, Virtual Room

Abstract

Objectives & Theoretical Framework
Previous studies indicate raters have a substantial impact on writing assessment outcomes (Janssen, Meier, & Trace, 2014). Researchers are concerned with the degree to which rater errors and systematic biases introduce construct-irrelevant variance into the interpretation of ratings—thus threatening the evidence of reliability, validity, and fairness of any writing assessment (Attali, 2016; Trenary & Farrar, 2016). Historically, researchers focused on how to improve rating quality, assuming the problem resides in poorly trained raters (Attali, 2016; Wolfe, Mathews, & Vickers, 2010). Yet, focusing on rater training may not be enough to ensure psychometrically-sound writing assessments. It is also possible that characteristics of rubrics can contribute to limitations in rater-mediated writing assessments.

The purpose of the present study is to present validity evidence for the Writing Rubric to Inform Teacher Educators (WRITE). Our rubric is unconventional in its design, and we hypothesize the WRITE contributes to psychometrically sound rater-mediated writing assessment systems across raters and samples. We argue the WRITE has strong construct validity. As teacher educators mplemented the WRITE, we also examined the consequential validity for classroom use and providing feedback to students, in this case, teacher candidates.

Methods/Data Sources
Four teacher educators underwent rater training and calibration procedures to learn how to use the WRITE. The raters then independently scored each sample, with at least three raters scoring the same samples, resulting in a total of 138 unique ratings. From the ratings, we conducted psychometric analyses using the Many-Facet Rasch model (MFRM) to gather evidence of validity, reliability, and fairness.

Results & Conclusions
Among four raters, the inter-rater reliability ranged from .833 to .953 for the six elements measured by the WRITE. The data displayed acceptable fit to the MFRM. We observed relatively high reliability of separation statistics that indicated meaningful differences in student achievement (Rel = 0.94), rater severity (Rel = 0.86), and item difficulty (Rel = 0.91). Acceptable model-data fit implies the estimates of student achievement were meaningfully adjusted to account for rater severity differences—this is an important component of the fairness of the WRITE. We explore a variety of aspects of the psychometric quality of the WRITE in the full paper, including analyses of individual student rating profiles, the quality of ratings associated with individual raters, and the degree to which the WRITE rating scale categories functioned as expected. We also consider social consequences of use, particularly intended and unintended consequences for teacher candidates, as well as the consequences of engaging in the training and scoring processes for teacher educators.

Scholarly Significance
Focused on construct validity, the WRITE produces scores that are reliable, valid, and fair, psychometrically. Focused on consequential validity, raters determined that the WRITE provided raters with more elements to consider during scoring than are typically present on writing rubrics and required more time to implement. Raters found they could provide more nuanced feedback, manipulate the rubric by using specific portions as deemed appropriate, and produce more fair and trustworthy scores across students. The WRITE is feedback-oriented rather than score-oriented, which produced positive effects on teacher candidates’ self-efficacy.

Authors