Paper Summary
Share...

Direct link:

Using Multivariate G-Theory to Examine Rater and Source Effects in Cross-Country Critical Thinking Performance Assessment

Sun, April 12, 9:45 to 11:15am PDT (9:45 to 11:15am PDT), InterContinental Los Angeles Downtown, Floor: 7th Floor, Hollywood Ballroom I

Abstract

1. Objectives
This study investigated how different rater combinations (human-human, human-AI, human-human-AI) and evidence sources (video vs. transcript) affect the reliability and coherence of critical thinking (CT) assessment scores across Colombia and Switzerland. The research addresses a critical challenge in cross-national performance assessment: maintaining measurement quality when integrating human and AI ratings.

2. Theoretical Framework
The study employed the International Performance Assessment of Learning (iPAL) framework, which defines CT as a multifaceted construct encompassing conceptualizing, analyzing, synthesizing, evaluating, and applying information to solve problems (Braun et al., 2020). Participants completed a migration policy task requiring them to analyze documents of varying trustworthiness and write a policy recommendation while thinking aloud. The assessment evaluated six CT facets: Trustworthiness, Relevance, Consequences, Perspectives, Evaluate Content, and Contrast and Connect.

3. Methods
Participants completed a CT performance task involving a migration policy debate. Responses were segmented and rated across six CT facets: Trustworthiness, Relevance, Consequences, Perspectives, Evaluate Claims, and Contrast and Connect. Ratings were assigned using a 0–0.5–1 scale. We tested three rater combinations: human–human (video and transcript), human–AI (transcript), and human–human–AI (mixed and consistent source).

4. Data Sources or Materials
The sample comprised 17 undergraduate teacher education students (10 from Colombia, 7 from Switzerland) who completed think-aloud protocols during task competition. Three rating approaches were employed: (1) R1: human rater using video recordings, (2) R2: human rater using transcripts, and (3) AI. The AI system was based on a fine-tuned DistilBERT multi-label classification model optimized for small datasets of 1,101 text segments. Multivariate Generalizability Theory analyzed four rater combinations to examine person, rater, and person × rater variance components across the six CT facets.


5. Results
Four key findings emerged. First, consistency in transcript source was more critical than rater type for measurement quality. Models using transcript sources (R2+AI) achieved the highest dependability coefficients (Φ = 0.84 in Colombia, Φ = 0.90 in Switzerland), while mixed-source combinations produced the lowest reliability (R1+AI: Φ = 0.23 in Colombia, Φ = 0.59 in Switzerland).
Second, transcript-based raters consistently identified more CT evidence than video-based raters across all facets and countries, suggesting that textual analysis provides richer access to student reasoning processes.
Third, substantial cross-country differences emerged, with Swiss students demonstrating higher person variances and more reliable ratings than Colombian students across most models.
Fourth, mixed evidence sources (video and transcripts) distorted construct structure, producing out-of-range correlations and negative relationships between theoretically related facets, while consistent sources maintained coherent multidimensional CT structure.

6. Significance
The findings challenge assumptions about human-versus-machine scoring by demonstrating that human-AI combinations with aligned transcript evidence sources can outperform human-human combinations with mixed sources. The study offers practical guidance for designing scalable, culturally responsive assessments: prioritize consistent evidence sources and validate scoring models within each cultural context. These findings have important implications for international assessments seeking to balance measurement rigor with implementation efficiency while maintaining validity and reliability across diverse uses.

Authors