Paper Summary

Some Notes on a Hierarchical Rater Model for Constructed Responses

Sat, April 14, 8:15 to 9:45am, Marriott Pinnacle, Floor: Fourth Level, Ambleside

Abstract

The use of constructed response (CR) items in large scale assessments necessitates the use of an approach that differs from that used for multiple choice items, because CR items require raters to score them, whereas multiple choice items can simply be machine-scored as right or wrong. Here we note that a hierarchical model, referred to previously as the HRM-SDT (DeCarlo, 2010; DeCarlo, Kim, & Johnson, 2011), addresses several issues that are associated with the use of CR items. The model includes a latent class signal detection theory (SDT) model in the first level and an item response theory (IRT) model in the second level.

To start, note that the usual approach to constructed responses, which is to use an IRT model, treats rater scores as being direct indicators of examinee ability. It follows that one can obtain more information about examinee ability simply by using more raters, rather than by using more items (see DeCarlo et al., 2011), which clearly should not be the case. This problem does not arise for the HRM-SDT, however, because it recognizes that the use of raters introduces an additional layer. In particular, in the first level of the model – the rater model – raters are viewed as providing (fallible) information about the (latent) quality of a constructed response; the Level 1 model provides measures of rater precision and rater effects. An advantage of the approach via the HRM-SDT is that the latent class SDT model in the first level correctly handles various “rater effects” that have been found in the literature. In the second level – the item model – the latent CR qualities are viewed as being ordinal indicators of examinee ability; the Level 2 IRT model provides measures of item difficulty and discrimination.

Another problem arises because of the structure of the data for CR items. In particular, many large scale assessments typically use more than one rater to score each CR item. As a result, if an examinee's essay is graded by three raters, for example, then the three ratings will tend to be correlated because they are all based on the same essay. Ignoring this correlation can give standard errors of examinee ability that are biased. Because it recognizes the nested structure of the data, the HRM-SDT provides an appropriate statistical procedure that takes the correlations into account.

The HRM-SDT can also be extended to handle MC items by including them as direct indicators of ability in the second level of the model. This approach provides answers to questions about how to combine responses to the two types of items (Kim, 2009). The extended HRM-SDT also appears to give improved parameter estimates in both levels. Another advantage of including MC items in Level 2 is that the model is identified even if there is only one CR item, as is the case in some tests (e.g., the SAT). The approach also allows one to examine issues related to dimensionality. Some examples will be presented and discussed.

Authors