Zuckerberg Knowingly Used Pirated Data to Train Meta AI, Authors Allege

Mark Zuckerberg approved the use of pirated books to train Meta AI even after his own team warned that the material had been obtained illegally, a group of authors allege in a recent lawsuit.

The allegations stem from a copyright infringement lawsuit filed in July 2023 by a group of authors, including comedian Sarah Silverman, Christopher Golden and Richard Kadrey, in a California federal court. The group alleged that Meta misused their books to train his Llama LLM, and they ‘seek damages and an injunction prohibiting Meta from using their works. The judge in the case has denied most of the author’s claims in November of the same year, but these recent allegations could reignite the legal dispute.

“Meta CEO Mark Zuckerberg approved Meta’s use of the LibGen dataset, despite concerns within Meta’s AI management team (and others at Meta) that LibGen is ‘a dataset that we know is illegal’” , attorneys for the plaintiffs said in a Wednesday report. Despite these red flags, the lawsuit alleges that “after escalation” Zuckerberg gave the green light for Meta’s AI team to continue using the controversial data set.

Meta representatives did not immediately respond Declutter‘s request for comment.

LibGenshort for Library Genesis, is an online platform that provides free access to books, academic papers, articles and other written publications without properly adhering to copyright laws. It acts as a ‘shadow library’ and offers these materials without permission from publishers or copyright holders. Currently, it houses more than 33 million books and more than 85 million articles.

The lawsuit claims Meta hid this until the last possible moment. Just two hours before the December 13, 2024 deadline for discovery, the company dumped what prosecutors described as “some of the most incriminating internal documents it has produced to date.”

Meta’s own engineers seemed uncomfortable with the plan, according to statements in court filings. The group of authors claims that internal messages show that Meta engineers were hesitant to download the pirated material, with one noting that “torrenting a [Meta-owned] business laptop doesn’t feel right (smile emoji).” Nevertheless, they not only proceeded to download the books but also systematically removed the copyright information to prepare them for AI training, the lawsuit alleges.

The latest pieces in the lawsuit paint a picture of a company fully aware of the risks: an internal memo warned that “media reporting suggesting that we have used a data set that we know is illegal, such as LibGen, could undermine our bargaining power with regulators.” Yet Meta continued to download and distribute (or “seeding”) the illegal content via torrent networks by January 2024 anyway, the lawsuit said.

When questioned about these activities in a statement, Zuckerberg appeared to distance himself from the decision, testifying that such piracy would raise “a lot of red flags” and “seems like a bad thing.”

The court documents also suggest that Meta’s approach to handling copyrighted information paid more attention to model training than to copyright rules. According to the filing, one engineer “filtered […] copyright rules and other data from LibGen to prepare a CMI stripped version of it for training llamas. “This systematic removal of copyright information could strengthen the authors’ claims that Meta knowingly attempted to conceal the use of pirated material.

The revelations come at a crucial time for Meta’s AI ambitions. The company has been working hard to compete with OpenAI and Google in the AI space, with Llama 3.2 the most popular open source LLM, and Meta AI is a solid free competitor to ChatGPT with similar features.

Most of these AI companies are facing legal battles due to their questionable practices when it comes to training their large language models. It already was Meta sued by another group of authors due to copyright infringements, OpenAI is currently facing several lawsuits for training its LLMs on copyrighted material, and Anthropic is also facing different accusations of authors and songwriters.

But overall, tech entrepreneurs and makers have been taking action since generative AI exploded in popularity. There are currently dozens of different lawsuits against AI companies that voluntarily use copyrighted material to train their models. But as with most things on the edge, we’ll have to wait and see what the courts have to say about all this.

Generally intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Source link

Generally intelligent Newsletter

Latest News

News

Resources

Pro Widgets

Zuckerberg Knowingly Used Pirated Data to Train Meta AI, Authors Allege

Generally intelligent Newsletter

North Dakota Considers Crypto Reserve as State Bitcoin Treasuries Gain Momentum

Biden's Consumer Watchdog Pushes for Last-Minute Stablecoin Rule

You may also like

Latest News

News

Resources

Pro Widgets