Microsoft and Meta quizzed on AI copyright

Large language models are trained using vast amounts of public data – but do the hyperscalers comply with copyright laws?

Cliff Saran, Managing Editor

Published: 17 Nov 2023 10:00

Microsoft and Meta, the parent company of Facebook and WhatsApp, were quizzed by parliamentarians during a House of Lords Communications and Digital Committee meeting about the state of artificial intelligence (AI) and large language models (LLMs).

While Microsoft and Meta discussed the huge advances that have been made, both appeared to have given little thought on the implications of their work on protecting intellectual property (IP). LLMs need access to vast swathes of freely available data on the internet to improve accuracy and reduce bias – but scraping text for commercial use infringes UK copyright law.

Vice-president and deputy chief privacy officer for policy at Meta, Rob Sherman, and director of public policy at the office for responsible AI at Microsoft, Owen Larter, both appeared to argue the case to change UK laws in line with some other countries, thereby allowing their AI software and LLMs to use data freely.

The session with the Microsoft and Meta representatives started with committee members enquiring about the current state of LLMs and the opportunities and risks.

When asked about the hype versus reality of generative AI (GenAI) and LLMs, Meta’s Sherman described the industry as reaching “a really important inflection point in the development of AI”, where models are able to run much more efficiently.

“A big area of investment for us is in both the ability to detect bias in machine learning models where they exist and to correct for that bias so that we can actually make the world a bit more fair and inclusive where we can,” he added.

Microsoft’s Larter said he was really enthusiastic about the opportunity of AI, especially in terms of productivity. Larter used the question to showcase Microsoft’s Copilot AI, particularly the GenAI in GitHub Copilot, which he said significantly boosts the productivity of software developers by “auto-completing” snippets of code.

The Lords were keen to hear what the two experts thought about open and closed data models and how to balance risk with innovation. Sherman said: “Our company has been very focused on [this]. We’ve been very supportive of open source along with lots of other companies, researchers, academics and nonprofits as a viable and an important component of the AI ecosystem.”

Along with the work a wider community can provide in tuning data models and identifying security risks, Sherman pointed out that open models lower the barrier to entry. This means the development of LLMs is not restricted to the largest businesses – small and mid-sized firms are also able to innovate.

In spite of Microsoft’s commitment to open source, Larter appeared more cautious. He told the committee that there needs to be a conversation about some of the trade-offs between openness and safety and security, especially around what he described as “highly capable frontier models”.

“I think we need to take a risk-based approach there,” he said. Future generations of AI models may offer significant capabilities that offer societal benefits, but may present serious risks. “We need to think very carefully about whether it makes sense to open source those models or not.”

Copyright considerations

Lord Donald Foster asked Microsoft and Meta about the implications of the UK copyright laws relating to data used to train LLMs, saying: “In the UK, basically text and data mining requires purchasing a copyright licence with a number of exceptions, not least non-commercial activity. But if the purpose of ingesting text and data for training an LLM is to create a commercial product then it's pretty clear that a copyright licence would be required. The focus is on intent.”

Sherman argued that copyright laws were established well before LLMs and the courts would ultimately decide what is and is not fair use. He said: “I do think that maintaining broad access to information on the internet and for the use of innovation like this is quite important. I do support giving rights holders the ability to manage how their information is used.” He claimed that mechanisms such as the robots.txt file on websites allows website owners to let others know whether text and data from their websites can be scraped.

Larter said: “Jurisdictions like the EU and Japan recently clarified that there is an exception within their law for text and data mining within the context of training an AI model. There’s a longer list of countries that have that type of regime.”

However, there is no consensus over whether AI should be exempt from copyright law. The authors of a recent Harvard Business Review article wrote: “AI developers should ensure that they are in compliance with the law in regards to their acquisition of data being used to train their models. This should involve licensing and compensating those individuals who own the IP that developers seek to add to their training data, whether by licensing it or sharing in revenue generated by the AI tool.”

Microsoft and Meta quizzed on AI copyright

Large language models are trained using vast amounts of public data – but do the hyperscalers comply with copyright laws?

Copyright considerations

Read more about LLMs

Read more on Artificial intelligence, automation and robotics

Large language models provide unreliable answers about public services, Open Data Institute finds

UK copyright law unfit for protecting creative workers from AI

AI training, copyright issues headline U.S. Senate hearing

AI lawsuits explained: Who's getting sued?