Academic publishers sell access to research papers to technology companies to train artificial intelligence (AI) models. Some researchers have reacted with dismay to such deals, which take place without the authors' consultation. The trend raises questions about the use of published and sometimes copyrighted works to train the growing number of AI chatbots in development.

Experts say a research paper that hasn't yet been used to train a large language model is likely to be used soon. Researchers are exploring technical options for authors to determine whether their content is being used.

Last month it was announced that British science publisher Taylor & Francis, based in Milton Park, UK, had signed a $10 million deal with Microsoft, allowing the US tech company to access the publisher's data to improve its AI systems. In June, an investor update showed that US publisher Wiley made $23 million by allowing an unnamed company to train generative AI models on its content.

Anything available online — whether in an open-access repository or not — has “quite likely” already been fed into a large language model, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle. “And if a paper has already been used as training data in a model, there is no way to remove that paper after training the model,” she adds.

Massive data sets

LLMs are trained on huge amounts of data, often siphoned from the Internet. They identify patterns between the often billions of speech snippets in the training data, so-called tokens, which enable them to generate texts with amazing fluency.

Generative AI models rely on ingesting patterns from these masses of data to output text, images or computer code. Scientific papers are valuable to LLM developers because of their length and “high information density,” says Stefan Baack, who analyzes AI training data sets at the Mozilla Foundation in San Francisco, California.

The tendency to purchase high-quality data sets is growing. This year has theFinancial Timestheir material dem ChatGPT developer OpenAI offered in a lucrative deal, as did the online forum Reddit to Google. And since academic publishers are likely to view the alternative as illicit skimming of their work, “I think there will be more deals like this to come,” Wang says.

Secrets of information

Some AI developers, like the Large-scale Artificial Intelligence Network, intentionally keep their datasets open, but many companies developing generative AI models have kept much of their training data secret, Baack says. “We have no idea what’s in it,” he says. Open source repositories such as arXiv and the scientific database PubMed are considered “very popular” sources, although paywalled journal articles are likely to be siphoned off by major technology companies for free-to-read abstracts. “They are always on the hunt for this kind of information,” he adds.

It's difficult to prove that an LLM used a particular paper, says Yves-Alexandre de Montjoye, a computer scientist at Imperial College London. One option is to confront the model with an unusual sentence from a text and see if the output matches the next words in the original. If this is the case, it is a good sign that the paper is included in the training set. If not, that doesn't mean the paper wasn't used - not least because developers can program the LLM to filter the answers to ensure they don't match the training data too closely. “It takes a lot to make this work,” he says.

Another method of checking whether data is included in a training dataset is called a membership inference attack. This is based on the idea that a model will be more confident about its output when it sees something it has seen before. De Montjoye's team has developed a version of this, called the copyright trap, for LLMs.

To set the trap, the team generates plausible but nonsensical sentences and hides them within a work, such as white text on a white background or in a field displayed as zero width on a web page. If an LLM is "surprised" by an unused control sentence - a measure of its confusion - more than by the sentence hidden in the text, "that's statistical evidence that the traps have been seen before," he says.

Copyright issues

Even if it were possible to prove that an LLM was trained on a particular text, it is not clear what happens next. Publishers claim that using copyrighted texts in training without licensing is considered infringement. But a legal counterargument says that LLMs don't copy anything - they extract information content from the training data, crunch it, and use their learned knowledge to generate new text.

Perhaps a court case could help clarify this. Sued in an ongoing US copyright case that could be groundbreakingThe New York TimesMicrosoft and the developer of ChatGPT, OpenAI, in San Francisco, California. The newspaper accuses the companies of using their journalistic content to train their models without permission.

Many academics are happy to have their work included in the training data of LLMs - especially as the models become more accurate. “Personally, I don’t mind if a chatbot writes in my style,” says Baack. But he admits that his profession is not threatened by the expense of LLMs as those of other professions, such as artists and writers, are.

Individual academic authors currently have little leverage when their paper's publisher sells access to their copyrighted works. For publicly available articles, there is no established means of assigning credit or knowing whether text has been used.

Some researchers, including de Montjoye, are frustrated. “We want LLMs, but we still want something that’s fair, and I don’t think we’ve invented what that looks like yet,” he says.