Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
X (Twitter)
Since the introduction of ChatGPT, large language models have drawn significant attention from the public. A recent survey showed students have not only been using the tools for learning, but also to generate essays and game assessments, leading to serious threats to test security and academic integrity. While automated detectors showed great potential in detecting AI-generated text, the fairness and robustness of these detectors is a critical but underexplored topic. Findings from a small-scale study using a limited number of samples indicated that more than half of the essays written by non-native English speakers (N = 91) were misclassified by existing GPT detectors as AI generated, while those written by native English speakers (N = 88) were mostly accurately identified. This raised concerns about the bias against non-native English speakers when using automated detectors to identify AI-generated essays, especially in high-stakes assessments.
Our study aims to replicate and expand the previous study by systematically investigating the fairness and robustness of different approaches in detecting AI-generated essays using a large set of data from educational assessments. We developed and evaluated automated detectors of ChatGPT-generated essays in a large-scale writing assessment, and explored the potential bias in the detectors by comparing their performance among essays written by native and non-native English speakers. Specifically, we used OpenAI’s gpt-3.5-turbo large language model to produce 10,000 AI-generated essays in response to 50 prompts from a large-scale writing assessment. In addition, we randomly sampled human-written essays from 111,375 operational responses to the same prompts that were submitted by test takers between 08/2022 and the release of ChatGPT on 11/30/2022. We then generated NLP features and fed the features into standard classifiers to train machine-learned models of AI-generated essays.
Our findings showed that NLP features could predict AI-generated essays with an accuracy of approximately or above 99%. Comparison of detector performance showed slightly higher false positive rates for native samples than non-native samples. In other words, the detectors were consistently more likely to misclassify essays written by native speakers as AI generated than those written by non-native speakers, although the difference was small. For example, when applying the detector to human-written essays that were set alone from model training and evaluation, 0.12% of the essays written by native English speakers were misclassified as AI generated, whereas only 0.006% of the non-native samples were misclassified as AI generated.
These results countered the claims made by Liang et al. that most ChatGPT detectors are biased against non-native English speakers. Our findings showed our detectors were more likely to flag essays written by native English speakers as AI generated, possibly because native writers’ essays had characteristics that resembled AI-generated essays (e.g., fewer grammatical errors). To our knowledge, this is the first study that systematically examined the bias of ChatGPT detectors using large corpora of essays in a large-scale assessment. Findings called for human experts in the loop to review and evaluate the flagged cases to address potential bias.
Liang, W., et al. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), 100779. https://doi.org/10.1016/j.patter.2023.100779