Self-reported benchmarks. Read carefully.
That’s the working posture for anyone evaluating Claude Fable 5’s coding performance right now. The 80.3% SWE-Bench Pro figure that anchored Anthropic’s launch announcement comes from Anthropic’s system card, using Anthropic’s own scaffolding to run the evaluation. Independent leaderboard data presents a messier picture: contested orderings, scaffold-dependent scores, and at least one major independent benchmarking organization whose evaluation is still pending. The gap between what Anthropic published and what neutral evaluators have confirmed is the most important thing an enterprise team can know before making a deployment decision based on these numbers.
What Each Source Actually Measured
Three independent data points are now available, and they don’t tell the same story.
vals.ai (independent leaderboard): Fable 5 scores 95.00% on SWE-Bench Verified, per the vals.ai leaderboard. SWE-Bench Verified is a different benchmark from SWE-Bench Pro, it uses a curated subset of software engineering problems verified for quality and solvability. vals.ai is a third-party leaderboard not operated by Anthropic. That independence is meaningful. This is the benchmark figure enterprise teams can cite without a scaffold-dependency footnote.
Artificial Analysis (independent evaluator): Per Artificial Analysis’s independent evaluation, Fable 5 scores 1,932 on the GDPval-AA benchmark. Artificial Analysis runs evaluations across multiple frontier models and publishes comparative data. It’s a credible source. The GDPval-AA benchmark measures general-purpose development productivity across a range of task types, it’s not identical to SWE-Bench, but it’s independently administered and Fable 5’s score is competitive.
Anthropic system card (vendor-reported, own scaffold): Anthropic’s announcement places Fable 5 at 80.3% on SWE-Bench Pro. This is where the scaffold dependency lives. SWE-Bench Pro evaluations require a scaffolding layer that manages how the model interacts with the test environment. Anthropic ran their evaluation using their own scaffolding. Independent aggregators, including morphllm.com, note this explicitly, the 80.3% figure reflects performance under Anthropic’s evaluation conditions, not under a neutral harness that other models were also tested on. That’s not a disqualifying finding. It is a required footnote for any honest comparison.
The Contested Leaderboard
The competitive picture at the top of SWE-Bench Pro is genuinely unsettled, and the unsettledness has a specific shape.
Multiple leaderboard entries now claim top-of-table positioning on coding benchmarks, and they can’t all be right. Anthropic’s system card places GPT-5.5 at 58.6% on SWE-Bench Pro, a figure that would make Fable 5’s 80.3% a commanding lead. But that 58.6% figure for GPT-5.5 isn’t independently confirmed in available cross-references. Independent aggregators instead show GPT-5.5 entries via DeepSWE at or near the top of the coding leaderboard by different measures, suggesting the competitive ordering flips depending on which harness and which benchmark variant you use.
Don’t expect a clean answer from any single leaderboard. The benchmark ecosystem for frontier coding models is currently in a contested state where the same model can rank first or fifth depending on evaluation methodology. This isn’t a scandal, it’s a known limitation of leaderboard-based model evaluation when scaffolding choices affect outcomes and vendors have incentive to choose favorable conditions.
Disputed Claim
Warning
Independent Epoch AI evaluation of Claude Fable 5 is pending as of June 10, 2026. Epoch AI is the primary neutral-harness authority for frontier model comparison. SWE-Bench Pro competitive rankings should be treated as provisional until Epoch AI publishes.
What it means practically: any team citing “Fable 5 is the best coding model” based on the 80.3% SWE-Bench Pro figure is citing a vendor-scaffolded result in a contested field. That’s a different claim than “Fable 5 scores 95.00% on SWE-Bench Verified per an independent leaderboard.” Both statements are defensible with attribution. Only one of them is independently verified.
What the Evaluation Hierarchy Says About These Numbers
The AI benchmarking field has developed an informal but useful hierarchy for evaluating benchmark credibility. Applied to Fable 5’s figures:
Tier 2, Independent evaluation, not peer-reviewed: GDPval-AA at 1,932 (Artificial Analysis) and SWE-Bench Verified at 95.00% (vals.ai). These figures come from organizations that don’t have a commercial interest in Anthropic’s results and use their own evaluation infrastructure. They’re not peer-reviewed academic papers, but they’re meaningfully more independent than vendor self-reports.
Tier 4, Vendor internal evaluation with vendor scaffold: SWE-Bench Pro at 80.3% (Anthropic’s own scaffolding). This is what Anthropic measured under Anthropic’s conditions. It may accurately reflect the model’s capabilities under those conditions. It doesn’t tell you how the model performs under a neutral harness, and that’s the comparison that matters when you’re evaluating against other models that were tested differently.
The pattern of contested benchmark leadership claims is familiar from prior pipeline cycles, MAI-Thinking-1, Claude Opus 4.8, and now Fable 5. The scaffold-dependency problem isn’t new. The field hasn’t solved it.
The Epoch AI Gap
Epoch AI is the closest thing frontier model evaluation has to a neutral authority. Their evaluations are independent, methodologically documented, and widely cited as a credible cross-model comparison source. As of June 10, 2026, Epoch AI has not published an independent evaluation of Claude Fable 5. The Epoch AI capabilities page for Fable 5 is unavailable, the URL is broken as of this writing.
That gap is the most important thing this brief can tell you. If your deployment decision depends on a SWE-Bench Pro comparison between Fable 5 and GPT-5.5, you’re comparing a vendor-scaffolded figure against an unconfirmed competitor figure, in a benchmark category where the competitive ordering is actively disputed. Epoch AI’s evaluation, when it publishes, will give you the neutral-harness comparison the current data can’t provide.
Until then: the vals.ai and Artificial Analysis numbers are what you have that’s independently grounded.
What to Watch
Unanswered Questions
- Which scaffolding did Anthropic use for the SWE-Bench Pro evaluation, and has it been published for replication?
- What harness did DeepSWE use to produce the GPT-5.5 results that contest Anthropic's 58.6% figure?
- When will Epoch AI publish its independent evaluation of Fable 5, and will it include both SWE-Bench Pro and Verified variants?
What Teams Should Actually Do
The practical split is straightforward. Enterprise teams evaluating Fable 5 for production coding workflows have two defensible options right now.
First: use the independently verified figures, SWE-Bench Verified 95.00% (vals.ai) and GDPval-AA 1,932 (Artificial Analysis), as your benchmark baseline. These don’t require a scaffold-dependency footnote. They won’t shift when Epoch AI publishes.
Second: run your own internal evaluation on your specific codebase and task distribution. The Stripe migration claim, Stripe reported Fable 5 completed a 50-million-line Ruby codebase migration in a single day, a task estimated to take a human team more than two months, is striking. It’s also Stripe-reported and distributed via Anthropic’s announcement materials. The corroboration is consistent but downstream. Stripe is a testing partner, not a neutral evaluator. Your codebase isn’t a 50-million-line Ruby monolith. Internal evaluation on your own task distribution is worth more than any leaderboard position for a production decision.
Two confirmed facts cut through the benchmark noise and matter for safety-sensitive deployments: Fable 5’s safeguard system triggers in less than 5% of sessions and falls back to Claude Opus 4.8 for high-risk queries. That’s confirmed from Anthropic’s primary announcement page, verbatim. It’s an architecturally significant design choice, the model is designed to be safe for general use, with Mythos 5 (same underlying model, safeguards lifted) reserved for vetted cyberdefenders under Project Glasswing. The safety architecture isn’t contested. The benchmark rankings are.
TJS Synthesis
The 80.3% SWE-Bench Pro figure will keep circulating in coverage of Fable 5 because it’s the largest number and it came first. Enterprise teams and practitioners using it for deployment comparisons should understand what they’re actually citing: a vendor-scaffolded evaluation in a benchmark category where the competitive ordering is genuinely contested and Epoch AI’s independent assessment is still pending.
The defensible decision path is this: treat SWE-Bench Verified at 95.00% (vals.ai) and GDPval-AA at 1,932 (Artificial Analysis) as your current baseline. Treat the SWE-Bench Pro 80.3% figure as a vendor claim requiring independent confirmation. Watch for Epoch AI’s evaluation, when it publishes, the field will have a neutral-harness comparison across multiple frontier models that settles the contested leaderboard question. Until then, run your own internal benchmarks on your task distribution. That’s the only comparison that’s actually about your deployment.