The AI copyright landscape has a new fork in it.
For two years, the dominant theory in AI copyright litigation has been output similarity: does the model produce content that substantially resembles a protected work? That’s the frame for the cases against Meta, OpenAI, and Anthropic that have been working through federal courts since 2023. It’s a theory grounded in traditional copyright analysis, did the defendant reproduce the protected expression?
Nazemian et al. v. NVIDIA Corp., No. 4:24-cv-01454-JST, introduces a parallel theory that doesn’t wait for output. According to legal analysis of the ruling, the court allowed direct infringement claims to proceed based on the act of copying protected works into a training dataset, not based on what the model does with them afterward. Judge Jon S. Tigar of the Northern District of California declined to dismiss those claims on May 4, 2026, finding them legally sufficient to survive a Rule 12(b)(6) motion.
That’s two distinct theories of copyright liability for AI training, now both alive in federal court simultaneously. Compliance teams need to understand what each one requires, and what exposure it creates.
Theory One: Output Similarity
The output-similarity cases, including the publishers’ suits against Meta over Llama training data, covered in prior hub coverage, ask whether the model reproduces protected expression in its outputs. The legal analysis tracks standard copyright doctrine: was there copying? Was the copied material substantial? Is there a fair use defense?
This theory has a natural ceiling at the output. If the model doesn’t generate outputs that resemble protected works, or if the company can argue that training-time copying is transformative, the theory has limits. It also means the liability analysis is heavily fact-dependent on what outputs are generated and for whom.
Theory Two: Input Copying
The input-copying theory argues the infringement happens at ingestion. If you copy a copyrighted book into your training dataset, you’ve reproduced that work, regardless of whether the model ever produces output that looks like that book. Per the legal analysis of Nazemian, the court found this theory legally cognizable at the pleading stage: the act of building a training dataset from copyrighted material, without license, is a potential act of direct infringement.
This matters because it removes the output from the analysis. A company that trains on shadow library content doesn’t escape liability by demonstrating that its outputs don’t substantially reproduce the training books. The infringement, if proved, happened during the data pipeline, before deployment.
The datasets named in the Nazemian complaint, Books3, The Pile, SlimPajama, and Anna’s Archive, are worth understanding. Books3 is a dataset assembled from Bibliotik, a private file-sharing community, containing hundreds of thousands of copyrighted books. Anna’s Archive is a search engine that indexes shadow libraries. The Pile is a large-scale dataset assembled by EleutherAI from multiple sources, including some with rights ambiguity. SlimPajama is a deduplicated, cleaned version of RedPajama, a large open-source training dataset. These aren’t obscure data sources. They appear across AI training pipelines broadly.
What the Rule 12(b)(6) Survival Actually Means
A Rule 12(b)(6) motion to dismiss tests legal sufficiency at the pleading stage, before discovery, before evidence. The court is asking: accepting the complaint’s allegations as true, does it state a legally viable claim? NVIDIA’s motion argued the answer was no. The court disagreed, for the direct infringement claims.
This is not a finding of liability. NVIDIA retains all its defenses, including fair use, for later stages in the proceeding. Fair use is a “mixed question of law and fact” under established federal doctrine, meaning it involves both legal analysis and factual findings. Courts typically don’t resolve it at the pleading stage, and Judge Tigar reportedly did not. NVIDIA can argue fair use at summary judgment or trial, and it may prevail.
What the 12(b)(6) survival tells you is that the input-copying theory cleared the minimum threshold for legal viability. Federal courts can now be asked to resolve this theory on the merits. That’s a materially different posture than if the court had dismissed.
The Shadow Library Exposure Map
The datasets named in Nazemian aren’t unique to NVIDIA. Books3 and Anna’s Archive in particular have appeared across multiple AI training pipelines at multiple companies. If the input-copying theory gains traction, if courts start finding that training on unlicensed copyrighted material constitutes direct infringement, the exposure doesn’t belong to NVIDIA alone.
The executive liability theory explored in prior hub coverage adds a further dimension: if corporate liability follows from training data decisions, the question of who made those decisions, and what they knew, becomes legally relevant. That’s a longer arc, but Nazemian‘s survival at 12(b)(6) keeps it in play.
What the Two Theories Mean Together
Running both theories simultaneously creates compounding exposure for AI companies that used shadow library content. Under output-similarity theory, you need to demonstrate your outputs don’t substantially reproduce protected works. Under input-copying theory, you need to demonstrate your data pipeline was properly licensed, or argue fair use, even if your outputs are clean.
That’s a different risk model than most AI legal teams have been working with. The output-similarity theory pointed toward output audits, content filters, and output similarity analysis. The input-copying theory points backward, toward data sourcing documentation, licensing records, and the provenance of every dataset in the training pipeline.
Implications for Training Data Acquisition Decisions
For organizations making decisions about AI model procurement or fine-tuning on proprietary data: the Nazemian ruling introduces a question worth raising with counsel about the models you’re deploying. If a model was trained on Books3 or Anna’s Archive content, and the input-copying theory produces a liability finding in this case, what does that mean for companies that licensed or deployed that model?
That’s not a settled legal question. It may not be the next question the courts reach. But it’s a question worth asking your legal team now rather than after the discovery phase in Nazemian produces documentation that makes the answer harder to ignore.
A note on sourcing: this deep-dive is based on a single T3 legal analysis source. The underlying court order should be verified via PACER (Case No. 4:24-cv-01454-JST, N.D. Cal.) before any compliance or legal decisions are based on the specific reasoning attributed to the court. This is legal reporting, not legal advice, engage qualified copyright counsel before acting.
What to Watch
Discovery in Nazemian is the near-term signal. If NVIDIA’s data sourcing practices are documented through discovery, the resulting record will affect not just this case but the legal posture of other defendants using the same datasets.
Watch also whether other AI companies named in shadow library complaints receive 12(b)(6) challenges and how those courts rule. If multiple courts reach the same conclusion, that input-copying claims are legally viable, the theory hardens from “alive in one case” to “established legal posture.” That’s when it becomes a compliance standard, not just a litigation risk.
TJS Synthesis
The Nazemian ruling doesn’t resolve the AI copyright question. It adds a second dimension to it. Output-similarity cases ask “what did the model produce?” Input-copying cases ask “what did you put in?” Both questions are now live in federal court simultaneously. The non-obvious implication for compliance teams is this: if your AI governance framework evaluates copyright risk primarily through output monitoring, it may be looking at only half the liability surface. The data pipeline that fed the model is now, at minimum, a question a federal court says deserves a full legal answer.