A lawsuit filed in the Southern District of New York, reported this week, names Meta as defendant in a copyright class-action brought by five major academic and trade publishers and author Scott Turow. The plaintiffs are Elsevier, Cengage, Hachette, Macmillan, and McGraw Hill. According to reporting from CBS News and Variety, the complaint alleges Meta used torrented datasets from sources the filing describes as “notorious pirate sites” to train the Llama suite of AI models.
The complaint also names Meta CEO Mark Zuckerberg personally, alleging he authorized the use of unlicensed training materials. That allegation is a legal claim in the complaint, it’s not a finding, and it hasn’t been tested. Personal liability naming in a copyright complaint is an aggressive litigation posture, but it doesn’t determine outcome. It does, however, signal the plaintiffs’ intent to pursue individual accountability alongside corporate liability.
What separates this case from similar AI copyright suits is the CMI removal allegation. The complaint reportedly includes claims under 17 U.S.C. § 1202, which prohibits the intentional removal or alteration of copyright management information, the embedded metadata identifying ownership, licensing terms, and rights holder contact details. This isn’t a standard infringement claim. CMI removal is a separate cause of action with its own proof requirements and statutory damages structure. The allegation, if it proceeds to discovery, would require Meta to produce evidence of how it handled metadata in its training pipeline, a different and potentially more invasive evidentiary burden than a straight reproduction claim. Variety’s reporting identifies this as a key element of the complaint.
Meta has stated through a spokesperson that the company believes its use of copyrighted material for AI training constitutes fair use under U.S. law. That defense is consistent with positions taken in other AI copyright litigation. Whether it holds is a legal question courts are actively working through in multiple jurisdictions.
This case sits in a growing docket of AI copyright suits. The White House’s copyright position has been contested in Congress and in court. The US and UK have reached different preliminary answers on training data fair use. The publishers filing in SDNY are adding a domestic federal case to a pattern that already includes Penguin Random House’s action against OpenAI in Munich and ANI’s suit against OpenAI in Delhi. Different jurisdictions, different legal frameworks, same underlying question: does AI model training require copyright authorization for the works it learns from?
What to watch: whether the CMI removal allegation survives a motion to dismiss. If it does, the discovery implications for Meta’s training data pipeline are significant. The case also adds to the regulatory pressure around training data provenance documentation, a requirement that has no formal legal mandate in the US yet, but is becoming a de facto expectation in any organization that wants to defend its AI development practices against litigation.
The non-obvious implication worth considering: if CMI removal becomes a viable litigation theory against AI training pipelines, the question isn’t only about torrented datasets. Any training data pipeline that processed web-scraped content and stripped metadata in the process could face exposure under the same theory, a scope that extends well beyond piracy allegations to standard data preparation practices across the industry.