Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
1. Objectives/Perspectives
The strong validity of performance assessments (PA) in measuring critical thinking (CT) in college students is well documented (Braun et al., 2020; Shavelson et al., 2019). Consistently, most current commercially available tests of CT include a PA component (e.g., Halpern & Dunn, 2021), even though others have argued convincingly that PA tasks as measures of CT are not interchangeable because each PA “story” contains task-specific aspects unrelated to CT, in which subjects vary (e.g., familiarity with the issue, strength of conviction related to the issue, see Shavelson et al, 2019). The PA-specific variance component is not problematic in CT measurement on an aggregate level when the different PA tasks are randomly assigned to members of the critical comparison groups, for example, the freshmen and senior cohort of a university. However, they prevent a reliable measurement of change on the individual level because memory effects preclude researchers from giving the same PA task again. Giving a different task, however, violates the assumption of tau-equivalency, a necessary condition for measuring change according to classical test theory. This dilemma is best solved using generalizability theory (GT) which introduces PA variance as a second true-score component also called an additional “facet” (Shavelson & Webb, 1991).
2. Method/Data
The Bryn Bower Series (BBS) consists of four similarly structured PA tasks that are scored not based on rating scales but on counts of features identified in the essays the students write. The design of the BBS tasks minimizes sequence effects (Ebright-Jones, 2024), hence providing a unique opportunity to demonstrate the usefulness of GT as a framework for the measurement challenges associated with PA tasks. However, the BBS scoring produces only a single overall CT score which precludes a standard GT analysis where a set of items provides the basis for the analysis. In our paper, we demonstrate that the general idea of GT can be adapted for a single-score situation with longitudinal data in the context of structural equation modeling when some statistical assumptions hold that are logically reasonable (e.g., no interaction in the population between specific tasks and change over time).
3. Results/Scientific Significance
Using a three-wave longitudinal data set of 120 juniors and seniors with three BBS tasks, we show that 10.1% of the score variance in the first wave is attributable to the facet of the task. Using GT variance algebra, we demonstrate that the dependability coefficient (a specific GT reliability coefficient) reaches an acceptable level for an individual diagnostic if participants complete two BBS tasks for a single-time estimate of their CT score. We will critically discuss the implications of our findings for the common practice of using single PA tasks for diagnostic purposes in personnel selection. Overall the analysis supports the notion that the best use of PA tasks for the measurement of CT lies in the comparison of aggregate units (e.g., college cohorts, colleges).