Detector for AI-Generated Text in Chemistry Papers

Text generators based on artificial intelligence (AI), such as ChatGPT, pose new challenges for scientific journals. Many journals require disclosures of the use of ChatGPT in writing the manuscript, and often, they ban naming ChatGPT as a non-human author. So far, such AI text generators often make up facts that are not true, which can be an issue for the integrity of the scientific record if they are used without proper care. Methods for detecting if AI has been used in preparing a scientific paper are, thus, important. Existing tools for this can have drawbacks such as a bias against writers who are non-native English speakers.

Heather Desaire, University of Kansas, Lawrence, USA, and colleagues have developed an AI detector that was specifically tested on articles from chemistry journals. The team aims to identify text that was generated both by the GPT-3.5 and GPT-4 versions of ChatGPT, as well as text that was created using prompts designed to hide the use of AI.

Creating a Classification Model

The researchers used ten chemistry journals in their training set, using the introduction sections of ten articles per journal to give a total of 100 samples of human writing. For each of these samples, they generated two different “AI versions” using different prompts, one based on just the title and one based on the abstract of the corresponding paper.

For each paragraph in the resulting writing samples, the team determined 20 features such as the complexity of the text, the variability of sentence lengths, the use of punctuation marks, and the frequency of specific words that are favored by human writers or ChatGPT, respectively. This data was then used to train a so-called XGBoost model (XGBoost = a type of machine learning library), which can classify writing samples outside of the training set.

Testing the Model’s Accuracy

The team’s model was then tested, first with articles from a different issue of the same journals used in the training set. It classified 94 % of the human-generated text correctly, as well as 98 % of AI-generated text based on the abstracts, and 100 % of AI-generated text based on the titles. A comparison with other leading AI detectors showed that the new model performed much better at identifying the AI-generated texts in the team’s dataset.

The researchers then expanded the types of articles to chemistry papers from other journals and publishers and included newspaper articles as a comparison. They also used new types of prompts for the AI-generated texts that try to hide the use of AI, e.g., by prompting ChatGPT to write like a chemist, write using technical language, or write in a way that would fool an AI detector. The chemistry articles were correctly classified between 92 % and 98 % of the time, even with the new prompts, while the newspaper articles written by humans were not correctly classified in most cases. According to the team, this shows that the detector is effective at what it was developed for, i.e., academic scientific writing. Applying it to other types of texts would require re-engineering the feature set and training a new model.