Paper Summary
Share...

Direct link:

How to Fool an AI That Detects AI-Generated Content: Evaluating the Robustness of Nonauthentic Response Detection With Adversarial Examples

Sat, April 13, 11:25am to 12:55pm, Philadelphia Marriott Downtown, Floor: Level 5, Salon J

Abstract

Large language models (LLMs) such as GPT-4 and llama-2 can produce lengthy texts that are syntactically accurate and semantically coherent. This capability can have a substantial impact on the design and administration of learning and/or assessment tasks that elicit open-ended responses: learners and test takers can use an LLM to generate a text and submit it with little effort of their own, which is particularly concerning in a high-stakes assessment context. One way to address this concern is to develop and deploy an automated detector that takes a written response as input and yields a classification decision of whether the response was generated by an LLM or not. Multiple detectors of this nature, which themselves often rely on LLMs in one way or another, are available with varying levels of performance (e.g., Liu et al., 2023; Tian et al., 2023). However, the inner workings of such detectors are seldom examined. In fact, LLM-based text classification may perform well by simply picking up spurious signals (Niven & Kao, 2019) and be vulnerable to small changes in input texts (see, e.g., Wang et al., 2021). It is thus reasonable to question on what basis such detectors make classification decisions.

In this talk, we will present a case study that attempts to provide an answer to the question by probing an LLM-based, automated detector in depth using data from a high-stakes writing assessment. Specifically, we will examine the distribution of tokens in the training and test samples to explore potentially spurious signals in the data used to train and evaluate the detector and quantify the importance of word-level text input for the detector to make the classification decision. We will then introduce a set of adversarial examples generated by replacing highly influential words with appropriate synonyms that do not affect the local or global semantic coherence and evaluate the performance of the detector on the adversarial examples. Preliminary findings indicate that the automated detector would make a chance-level classification on the adversarial examples, which is much lower than its performance on the original test data set. We will conclude with substantive and methodological implications of these findings for researchers interested in developing and/or deploying automated detectors for AI-generated responses: Automated detectors may not be as effective as expected based on their test set performance and should be evaluated thoroughly before deployment to ensure robust and meaningful classification decisions that are beneficial to users.

Liu, Z., Yao, Z., Li, F., & Luo B. (2023). Check me if you can: Detecting ChatGPT-generated academic writing using CheckGPT. arXiv: 2306.05524v1.

Niven, T., & Kao, H-Y. (2019). Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Tian, Y., Chen, H., Wang, X., Bai, Z., Zhang, Q., Li, R., Xu, C., & Wang, Y. (2023). Multiscale positive-unlabeled detection of AI-generated texts. arXiv: 2305.18149v2

Wang, W., Wang, R., Wang, L., Wang, Z., & Ye, A. (2021). Towards a robust deep neural network in texts: A survey. arXiv: 1902.07285v6.

Author