Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Skip to content
Regulation Daily Brief

Five Publishers Sue Meta Over Llama Training Data, Allege Intentional CMI Removal to Conceal Pirated Sources

3 min read CBS News / Variety Partial
Elsevier, Cengage, Hachette, Macmillan, McGraw Hill, and author Scott Turow have filed a copyright class-action against Meta in the Southern District of New York, alleging the company trained its Llama suite of models on datasets sourced from what the complaint describes as "notorious pirate sites." The complaint adds a legally distinct allegation that Meta intentionally removed copyright management information from the works, a claim under 17 U.S.C. § 1202 that carries separate damages exposure and changes the legal character of the case.
6 plaintiffs, SDNY copyright class-action vs. Meta
Key Takeaways
  • Five major publishers and author Scott Turow have filed a copyright class-action against Meta in SDNY over Llama training data, lawsuit reported this week
  • The complaint alleges Meta used datasets from "notorious pirate sites" and names Mark Zuckerberg personally, both are allegations in the complaint, not findings
  • The CMI removal allegation under 17 U.S.C. § 1202 is legally distinct from standard infringement and carries separate damages exposure
  • Meta says AI training on copyrighted material constitutes fair use, consistent with positions taken in other AI copyright litigation
  • The case adds to a global docket including Penguin v. OpenAI (Munich) and ANI v. OpenAI (Delhi), all turning on the same core question about training data authorization
Analysis

The CMI removal allegation under 17 U.S.C. u00a7 1202 is the legally novel element here. It shifts the potential evidentiary burden from 'did you copy this work' to 'did you intentionally strip the metadata that identified it as copyrighted.' That's a different and potentially more invasive discovery inquiry, and the theory could extend to any training pipeline that processed and stripped metadata from web-scraped content, not just pirated sources.

Warning

All claims in this item are complaint allegations, not findings, rulings, or adjudicated facts. Meta disputes the core premise through a fair use defense. Nothing in this brief should be read as a characterization of liability.

A lawsuit filed in the Southern District of New York, reported this week, names Meta as defendant in a copyright class-action brought by five major academic and trade publishers and author Scott Turow. The plaintiffs are Elsevier, Cengage, Hachette, Macmillan, and McGraw Hill. According to reporting from CBS News and Variety, the complaint alleges Meta used torrented datasets from sources the filing describes as “notorious pirate sites” to train the Llama suite of AI models.

The complaint also names Meta CEO Mark Zuckerberg personally, alleging he authorized the use of unlicensed training materials. That allegation is a legal claim in the complaint, it’s not a finding, and it hasn’t been tested. Personal liability naming in a copyright complaint is an aggressive litigation posture, but it doesn’t determine outcome. It does, however, signal the plaintiffs’ intent to pursue individual accountability alongside corporate liability.

What separates this case from similar AI copyright suits is the CMI removal allegation. The complaint reportedly includes claims under 17 U.S.C. § 1202, which prohibits the intentional removal or alteration of copyright management information, the embedded metadata identifying ownership, licensing terms, and rights holder contact details. This isn’t a standard infringement claim. CMI removal is a separate cause of action with its own proof requirements and statutory damages structure. The allegation, if it proceeds to discovery, would require Meta to produce evidence of how it handled metadata in its training pipeline, a different and potentially more invasive evidentiary burden than a straight reproduction claim. Variety’s reporting identifies this as a key element of the complaint.

Meta has stated through a spokesperson that the company believes its use of copyrighted material for AI training constitutes fair use under U.S. law. That defense is consistent with positions taken in other AI copyright litigation. Whether it holds is a legal question courts are actively working through in multiple jurisdictions.

This case sits in a growing docket of AI copyright suits. The White House’s copyright position has been contested in Congress and in court. The US and UK have reached different preliminary answers on training data fair use. The publishers filing in SDNY are adding a domestic federal case to a pattern that already includes Penguin Random House’s action against OpenAI in Munich and ANI’s suit against OpenAI in Delhi. Different jurisdictions, different legal frameworks, same underlying question: does AI model training require copyright authorization for the works it learns from?

What to watch: whether the CMI removal allegation survives a motion to dismiss. If it does, the discovery implications for Meta’s training data pipeline are significant. The case also adds to the regulatory pressure around training data provenance documentation, a requirement that has no formal legal mandate in the US yet, but is becoming a de facto expectation in any organization that wants to defend its AI development practices against litigation.

The non-obvious implication worth considering: if CMI removal becomes a viable litigation theory against AI training pipelines, the question isn’t only about torrented datasets. Any training data pipeline that processed web-scraped content and stripped metadata in the process could face exposure under the same theory, a scope that extends well beyond piracy allegations to standard data preparation practices across the industry.

View Source
More Regulation intelligence
View all Regulation

More from May 5, 2026

Stay ahead on Regulation

Get verified AI intelligence delivered daily. No hype, no speculation, just what matters.

Explore the AI News Hub