The AI copyright litigation wave has a new case, a new theory, and a new named defendant. Elsevier, Cengage, Hachette, Macmillan, McGraw Hill, and author Scott Turow have filed a copyright class-action in the Southern District of New York against Meta, according to reporting from CBS News and Bloomberg Law. The core allegations, that Meta trained its Llama suite of models on datasets sourced from what the complaint calls “notorious pirate sites,” and that the company intentionally removed copyright management information to conceal those sources, are claims in a complaint, not adjudicated facts. But the legal theory they advance, if it survives a motion to dismiss, has implications that extend well beyond Meta’s training pipeline.
What the Complaint Alleges
The factual allegations, as reported, are in two distinct categories.
First, the standard training data infringement claim: Meta used the plaintiffs’ copyrighted works, sourced from torrented datasets, to train the Llama models without authorization or compensation. The complaint describes the sources as “notorious pirate sites”, a characterization that, if proven, undercuts the fair use defense Meta has already signaled it will assert. Fair use analysis under U.S. law involves a four-factor test. The commercial nature of the use, the effect on the market for the original work, and the nature of the copyrighted work all cut differently when the source is pirated material versus openly published content. Courts haven’t ruled definitively on this in the AI training context, but “sourced from piracy sites” is a harder fair use argument than “scraped from the open web.”
Second, and legally distinct: the CMI removal allegation under 17 U.S.C. § 1202. Copyright management information is the embedded metadata that identifies a work’s ownership, rights holder, licensing terms, and contact information. Section 1202 prohibits two things: (a) providing false CMI, and (b) intentionally removing or altering CMI with knowledge that doing so would facilitate infringement. The claim here, as reported, is that Meta removed this information from the works as part of its training data preparation. That’s not a reproduction claim, it’s a claim about what happened to the metadata in the pipeline before training began.
The significance is procedural as much as substantive. A successful CMI claim requires showing intentional conduct and knowledge of infringement facilitation. That means discovery into Meta’s data preparation processes, how the company acquired, cleaned, filtered, and processed training data, would need to address metadata handling specifically. That’s a different evidentiary inquiry than proving reproduction, and potentially a more granular one.
The complaint also names Mark Zuckerberg personally, alleging he authorized the use of unlicensed materials. That allegation is a litigation strategy, not a factual finding. Personal liability claims in copyright suits against executives are uncommon and rarely succeed at trial, but they create pressure to settle and signal that plaintiffs want accountability at the highest level, not just a corporate damages judgment.
How This Case Compares to What’s Already in Court
The SDNY complaint is the latest entry in a global docket of AI copyright cases. Comparing the three most prominent illustrates what’s new and what’s familiar.
Penguin Random House v. OpenAI (Munich): Filed in Germany under German copyright law, which provides narrower fair use-equivalent exceptions for text and data mining. The Munich case focuses on reproduction of protected expression, with a regulatory framework that is more skeptical of broad fair use claims than U.S. law. It’s the strictest-jurisdiction test of the underlying training data question. See the prior pipeline coverage on global copyright divergence for the German legal context.
ANI v. OpenAI (Delhi): An Indian news agency pursuing copyright claims in an Indian court, adding a third legal framework to the global picture. Indian copyright law has its own fair dealing doctrine, distinct from both U.S. fair use and German TDM exceptions. The Delhi case matters less for its legal theory than for its jurisdictional signal, AI copyright litigation is not a US-EU story, it’s a global one.
Publishers v. Meta (SDNY): The new case adds two things neither of the above includes. First, the CMI removal theory, a U.S.-specific statutory claim with its own damages structure ($2,500 to $25,000 per violation under § 1203, plus attorneys’ fees). Second, named executive personal liability. Both are litigation pressure tactics, but the CMI theory is also legally substantive in a way that could change how other plaintiffs frame their cases if it survives dismissal.
The US-UK copyright divergence analysis and the Senate IP hearing coverage provide the legislative backdrop: Congress hasn’t resolved the training data question, the White House’s framework position is contested, and courts are currently the primary venue for establishing U.S. precedent. The SDNY case adds to that judicial workload with a novel theory.
Meta’s Defense Posture
A Meta spokesperson stated the company believes its use of copyrighted material for AI training constitutes fair use under U.S. law. That’s a straightforward statement of the most commonly asserted AI copyright defense. The argument runs roughly as follows: training is transformative use, produces a new category of output (model weights, not reproductions), doesn’t substitute for the market for the original work, and is therefore protected under the four-factor fair use test.
The problem the piracy allegation creates for this defense is factor three, the nature of the copyrighted work, and potentially factor four, market harm. Courts may weigh the source of the training data when assessing whether use was fair. There’s no direct precedent on this specific question in the AI training context, but the general copyright principle that bad-faith acquisition of works affects fair use analysis exists in case law.
The CMI removal allegation creates a separate problem: it’s not a fair use question. Fair use is a defense to infringement. Section 1202 creates an independent cause of action. Even if Meta prevails on fair use, the CMI claim survives independently and requires its own defense.
What AI Companies Should Understand About Their Exposure
The CMI removal theory, if it advances, has implications that extend beyond pirated training data. Standard web-scraping pipelines routinely strip metadata from content during the cleaning and normalization process, removing HTML tags, formatting, and embedded metadata to produce clean training text. If that process removes copyright management information, and if the party doing the stripping knew the content was copyrighted, the statutory framework under § 1202 could apply even to content scraped from publicly accessible sources.
This isn’t a prediction, it’s a legal risk that courts haven’t addressed yet. But the SDNY complaint is the first attempt to put that theory before a federal court in the AI training context. AI companies that don’t have documentation of their training data provenance and metadata handling practices are operating without the evidence they’d need to defend against this theory if it gains traction.
The practical takeaway for compliance teams: training data provenance documentation is no longer just a best practice for ethical AI development. It’s becoming litigation-relevant evidence. Knowing where your training data came from, what happened to it during preparation, and whether metadata was preserved or stripped, these are questions you want answers to before opposing counsel asks them.