Meta Platforms used thousands of pirated books to train its AI models, alleges a new copyright lawsuit filed on Monday (Dec 11) night.

The tech giant reportedly did this despite warnings from their lawyers about the legal peril of doing so. The lawsuit as per Reuters was initially brought this summer.

Skipping permissions

Monday's filing consolidates two separate lawsuits brought against Facebook and Instagram by comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon and other prominent authors. They allege that Meta has used their works to train its artificial intelligence language model, Llama, without permission.

Previously, their claims also said that the AI model's output also violates their copyrights. However, in November, US District Judge Vince Chhabri criticised the claim that the text generated by Llama copies or resembles the author's work. Dismissing the claim, Chhabri allowed the authors to amend most of their claims.

As evidence, the authors have submitted Discord chat logs of a Meta-affiliated researcher, who can be seen discussing the procurement of the dataset. This, as per the news agency, can be a potentially significant piece of evidence that could prove that Meta knew that its use of the books may not be protected by US copyright law.

What is there in the chat logs?

In the chat logs, as quoted in the complaint, researcher Tim Dettmers, a doctoral student at the University of Washington, describes his back-and-forth with Meta's legal department. In it, he talks about whether the use of book files as training data would be "legally OK".

"At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons," Dettmers wrote in 2021.

Here, The Pile refers to an open-source language modelling data set Meta has acknowledged it was using to train its first version of Llama.

According to the complaint, a month before this, Dettmers wrote that Meta lawyers told him "the data cannot be used or models cannot be published if they are trained on that data."

Here, the researcher refrains from describing the specific concerns voiced by lawyers, however, his counterpart makes mention of "books with active copyrights," and says they are the biggest likely source of worry. They also say that training on the data should "fall under fair use."

As per the US Copyright Office, the fair use doctrine allows for the use of "limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports".

Llama

Llama is Meta's large language model (LLM). The tech giant released its first version in February 2023. At the time, it published a list of datasets used to train the language model. The list included "the Books3 section of ThePile," which, as per the complaint quoting the person who assembled the dataset, contains 196,640 books.

A second version, dubbed Llama 2 was made available for commercial use this summer. It is free for use for companies with monthly active users of less than 700 million.

AI and copyrights

As Artificial Intelligence drives a craze, artists across art forms have come forward alleging unlawful use of their content. In 2023, big tech companies were met with a slew of lawsuits that accuse them of ripping-off copyright protected works. IF successful, these lawsuits would not only generate significant compensation for the content creators, but also drive up the cost of building the data dependent models, somewhat dampening the generative AI craze.