Legal cases question IP in large language model training

Should the providers of commercial large language models licence content from content creators? The New York Times and Getty Images think so

Cliff Saran, Managing Editor

Published: 16 Jan 2024 12:30

A recent warning from OpenAI about the potential ramifications of a stringent copyright crackdown on artificial intelligence (AI) development has sparked a complex legal debate about the balance between AI advancement and intellectual property (IP) rights.

At the heart of the legal case is whether businesses that make money from licensing or selling web content should be compensated when a large language model (LLM) uses this content for training. Content creators have told the courts that their business models are being undermined, and content created by an LLM which was trained using their IP could be used to create AI-generated content that would be hard to distinguish from that produced by the IP owner.

A lawsuit filed on 27 December by The New York Times claims Microsoft and OpenAI used articles publicly available on The New York Times’ website to create artificial intelligence products that compete with and threaten the newspaper’s ability to provide its web news service. “Defendants’ generative artificial intelligence tools rely on large language models that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides and more,” The New York Times stated in the filing.

The newspaper said that although Microsoft and OpenAI engaged in wide-scale copying from many sources, they gave content from The New York Times particular emphasis when building their LLMs. “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the filing from the newspaper stated.

Meanwhile, in the UK, Stability AI has failed in a bid to have certain claims that it infringed the IP rights of Getty Images thrown out before the case goes to trial in the UK.

Discussing the two lawsuits and how LLMs are trained, Paul Joseph, IP partner at Linklaters, said: “From what I’ve read, generally there is at least an element of reading stuff, making copies of stuff, and then running crawlers or AI systems over them to learn. The making of copies along the way is part of the training process.” However, the act of making copies of the content, according to Joseph, is restricted by copyright laws.

For an LLM provider or an enterprise user of a commercial LLM that is trained this way, he said: “Unless you fall into one of a few copyright exceptions, then it will be an infringement, and it’s not easy to get this sort of commercial training exercise into any exceptions.”

Legal cases question IP in large language model training

Should the providers of commercial large language models licence content from content creators? The New York Times and Getty Images think so

Read more about LLM training

Read more on IT governance

LLM build vs. buy: A decision framework for LLM adoption

UK copyright law unfit for protecting creative workers from AI

Fair use rulings favor Meta and Anthropic but are limited

Does AI-generated code violate open source licenses?