AERA Annual Meeting: Measuring Change in College Students’ Critical Thinking: Application of Generalizability Theory to Performance Assessment Tasks

Information Menu
Search Tips

Navigation and Settings Menu
Change Preferences / Time Zone
Sign In

Back Home

Refresh: Off

Paper Summary

Share...

Direct link:

Measuring Change in College Students’ Critical Thinking: Application of Generalizability Theory to Performance Assessment Tasks

In Event: On the Comparability of Cross-National Assessment of Critical Thinking

Sat, April 26, 9:50 to 11:20am MDT (9:50 to 11:20am MDT), The Colorado Convention Center, Floor: Terrace Level, Bluebird Ballroom Room 3E

Abstract

1. Objectives/Perspectives
The strong validity of performance assessments (PA) in measuring critical thinking (CT) in college students is well documented (Braun et al., 2020; Shavelson et al., 2019). Consistently, most current commercially available tests of CT include a PA component (e.g., Halpern & Dunn, 2021), even though others have argued convincingly that PA tasks as measures of CT are not interchangeable because each PA “story” contains task-specific aspects unrelated to CT, in which subjects vary (e.g., familiarity with the issue, strength of conviction related to the issue, see Shavelson et al, 2019). The PA-specific variance component is not problematic in CT measurement on an aggregate level when the different PA tasks are randomly assigned to members of the critical comparison groups, for example, the freshmen and senior cohort of a university. However, they prevent a reliable measurement of change on the individual level because memory effects preclude researchers from giving the same PA task again. Giving a different task, however, violates the assumption of tau-equivalency, a necessary condition for measuring change according to classical test theory. This dilemma is best solved using generalizability theory (GT) which introduces PA variance as a second true-score component also called an additional “facet” (Shavelson & Webb, 1991).

2. Method/Data
The Bryn Bower Series (BBS) consists of four similarly structured PA tasks that are scored not based on rating scales but on counts of features identified in the essays the students write. The design of the BBS tasks minimizes sequence effects (Ebright-Jones, 2024), hence providing a unique opportunity to demonstrate the usefulness of GT as a framework for the measurement challenges associated with PA tasks. However, the BBS scoring produces only a single overall CT score which precludes a standard GT analysis where a set of items provides the basis for the analysis. In our paper, we demonstrate that the general idea of GT can be adapted for a single-score situation with longitudinal data in the context of structural equation modeling when some statistical assumptions hold that are logically reasonable (e.g., no interaction in the population between specific tasks and change over time).
3. Results/Scientific Significance
Using a three-wave longitudinal data set of 120 juniors and seniors with three BBS tasks, we show that 10.1% of the score variance in the first wave is attributable to the facet of the task. Using GT variance algebra, we demonstrate that the dependability coefficient (a specific GT reliability coefficient) reaches an acceptable level for an individual diagnostic if participants complete two BBS tasks for a single-time estimate of their CT score. We will critically discuss the implications of our findings for the common practice of using single PA tasks for diagnostic purposes in personnel selection. Overall the analysis supports the notion that the best use of PA tasks for the measurement of CT lies in the comparison of aggregate units (e.g., college cohorts, colleges).

Measuring Change in College Students’ Critical Thinking: Application of Generalizability Theory to Performance Assessment Tasks

Abstract

Authors