Paper Summary
Share...

Direct link:

Measuring Students’ Use of AI in Short Answer Questions in a Course on Intellectual Virtues

Sat, April 11, 3:45 to 5:15pm PDT (3:45 to 5:15pm PDT), Westin Bonaventure, Floor: Lobby Level, San Gabriel C

Abstract

While popular press and instructors’ anecdotal evidence suggest that using generative artificial intelligence (AI) to cheat is rampant, there is little empirical work studying the extent to which students are using AI to offload their cognition and violate academic integrity. One challenge in studying generative AI usage lies in identification. AI detectors are known to be unreliable and inaccurate (Weber-Wulff et al., 2023). Reliably and accurately measuring the extent to which students are using generative AI requires a multifaceted approach to account for the large variance in potential responses coming from different models, prompts, etc.

In order to get a sense of the degree to which students use AI to do their work for them, and to test detection methods, we recruited undergraduate students from three general education Philosophy courses to complete an online Canvas course on intellectual virtues; students were not explicitly informed that we were studying their potential AI usage until after completing the study. For this analysis, we focus on students’ responses to seven short answer quizzes. Short answer responses from 79 students, totaling 507 responses, were collected. Our mixed methods analysis is looking for content red flags, formatting red flags, pasted text, completion time, and qualitative coding of content to identify problematic AI usage in the short answer responses. Although our analysis is ongoing, here we describe three methods used to identify problematic AI usage (i.e., having generative AI tools directly answer the question).

Content red flags are used for high-certainty identification of AI usage through clear, structural indicators. We looked in responses to four questions for flags indicating the student copied the question text and pasted it into a generative AI chatbot, finding red flags in 18 responses. We also looked for formatting red flags in the raw HTML form of responses, such as HTML tags that appear when text is copied from generative AI platforms; 9 responses out of all 507 carried this flag. Altogether, content and formatting red flags were identified in 27 responses from submissions by 18 of the participants (23% of students). Additionally, we had two coders manually code 15% of responses (randomly sampling 11 per question). The coders independently coded this subset of the data as either having sufficient evidence of AI usage or not; they subsequently met to discuss discrepancies, reaching consensus on every response. Of the 77 responses reviewed, 27 (23%) were identified as having a high likelihood of generative AI usage.

Combining both analyses, we find that 28 out of 77 (36%) students had an AI chatbot answer at least one question; however, we note that this is an underestimate given that we only manually coded 15% of the data. This is especially concerning given that the course is focused on intellectual virtues.

Authors