For many years, Meta employees have internally deliberated over the use of copyrighted materials acquired through legally questionable methods to train the company’s AI models, as per court documents unsealed recently. These papers were submitted by plaintiffs in the case of Kadrey v. Meta, which is one of several ongoing AI copyright disputes being navigated through the U.S. court system. Meta, as the defendant, asserts that using intellectual property-protected works, particularly books, for training models constitutes “fair use.” However, plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, disagree with this stance.
Earlier submissions in the suit alleged that Meta CEO Mark Zuckerberg authorized the company’s AI team to train using copyrighted content and that Meta paused negotiations regarding AI training data licensing with book publishers. Newly disclosed documents, which include portions of internal work dialogues among Meta employees, provide a clearer view of how Meta possibly ended up utilizing copyrighted data for model training, including within its Llama family models.
In one internal chat, Melanie Kambadur, a senior manager on Meta’s Llama model research team, engaged in a discussion with colleagues about training models with works they acknowledged could be legally indefensible. According to the filings, a chat from February 2023 features Xavier Martinet, a Meta research engineer, suggesting that they could attempt to acquire books and escalate the matter to executives for a decision, embodying a mindset of “asking forgiveness, not for permission.” Martinet proposed buying e-books at retail prices to construct a training dataset instead of liaising with individual book publishers for licensing agreements. Even after a coworker remarked that using unauthorized copyrighted materials could invite legal challenges, Martinet alleged that numerous startups were probably already employed pirated books for training purposes.
Kambadur, during the same discussion, mentioned that while Meta was in negotiation talks with document hosting platform Scribd and others regarding licenses, the company’s legal advisors were now less risk-averse than before. She emphasized that obtaining licenses or approvals on publicly available data was still necessary, though the organization now possessed increased resources, legal prowess, and business development support to accelerate processes when necessary.
Moreover, another work chat included in the filings outlined a discussion about potentially using Libgen, a “links aggregator” that grants access to copyrighted works from publishers, as an alternative data source for licensing. Libgen has faced several lawsuits, shutdown orders, and fines due to copyright breaches. In a communication to Meta AI VP Joelle Pineau, Sony Theakanath, Director of Product Management at Meta, labeled Libgen as essential for achieving state-of-the-art (SOTA) benchmarking in AI models, suggesting possible legal risk mitigations, including not publicly discussing the dataset’s use.
Further revelations in the filings suggest that Meta might have scraped Reddit data for model training, possibly by imitating the operations of a third-party app named Pushshift, even as Reddit announced plans in April 2023 to start charging AI firms for data access for training purposes. In another chat from March 2024, Chaya Nayak, Director of Product Management at Meta’s generative AI org, revealed that Meta leadership was contemplating overturning prior decisions regarding training sets, including the exclusion of Quora content and licensed books, to ensure sufficient training data for its models.
The plaintiffs in the case have amended their complaint multiple times since filing in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The most recent amendment accuses Meta of comparing certain pirated books against copyrighted works available for licensing to judge whether initiating a licensing agreement with a publisher was necessary.
In recognition of the significant legal stakes involved, Meta has brought in two Supreme Court litigators from the law firm Paul Weiss to bolster its defense team. At this time, Meta has not provided any comment on the matter.