Three Sources, Three Different SWE-Bench Pro Leaders: How to Read Fable 5's Benchmarks

June 10, 2026 6 min read Anthropic Partial Strong

Tech Jacks Solutions AI News Coverage

Claude Fable 5 launched with an 80.3% SWE-Bench Pro score that multiple AI news outlets reported as the new coding benchmark leader. Three independent data sources say the picture is more complicated, and the differences matter for any enterprise team using benchmark comparisons to decide whether to migrate. This deep-dive maps what each source actually measured, where they conflict, and what the Epoch AI gap means for teams that need a decision now.

claude-fable-5 swe-bench ai-benchmarks anthropic model-evaluation ai-coding independent-evaluation epoch-ai benchmark-methodology

SWE-Bench Verified (independent), 95.00%

Key Takeaways

The 80.3% SWE-Bench Pro figure is vendor-reported using Anthropic's own scaffolding, scaffold-dependent scores can't be directly compared to models evaluated under neutral harnesses. vals.ai (independent) confirms 95.00% on SWE-Bench Verified; Artificial Analysis (independent) confirms 1,932 on GDPval-AA, these are the defensible baseline figures for enterprise evaluation.
The GPT-5.5 vs. Fable 5 competitive ordering on SWE-Bench Pro is actively contested: available independent data doesn't confirm the 58.6% figure Anthropic's system card assigns to GPT-5.5.
Epoch AI's independent evaluation of Fable 5 is pending as of June 10, 2026, that's the neutral-harness comparison the field needs and currently doesn't have.
For production deployment decisions, run internal benchmarks on your task distribution; leaderboard comparisons in a contested scaffold-dependent benchmark category aren't a reliable proxy.

Verification

Partial Anthropic system card (vendor-scaffolded) + vals.ai independent leaderboard + Artificial Analysis independent evaluation SWE-Bench Pro 80.3% is scaffold-dependent. Epoch AI independent evaluation pending. GPT-5.5 comparison figure not independently confirmable.

Fable 5 Benchmark Scores by Source and Evaluation Type

Benchmark	Score	Source	Evaluation Type	Independent?
SWE-Bench Verified	95.00%	vals.ai	Independent leaderboard	Yes
GDPval-AA	1,932	Artificial Analysis	Independent evaluator	Yes
SWE-Bench Pro	80.3%	Anthropic system card	Vendor scaffold	No
SWE-Bench Pro (Opus 4.8)	69.2%	Anthropic system card	Vendor scaffold	No
SWE-Bench Pro (GPT-5.5)	58.6%	Anthropic system card only	Unconfirmed by independent sources	No

Self-reported benchmarks. Read carefully.

That’s the working posture for anyone evaluating Claude Fable 5’s coding performance right now. The 80.3% SWE-Bench Pro figure that anchored Anthropic’s launch announcement comes from Anthropic’s system card, using Anthropic’s own scaffolding to run the evaluation. Independent leaderboard data presents a messier picture: contested orderings, scaffold-dependent scores, and at least one major independent benchmarking organization whose evaluation is still pending. The gap between what Anthropic published and what neutral evaluators have confirmed is the most important thing an enterprise team can know before making a deployment decision based on these numbers.

What Each Source Actually Measured

Three independent data points are now available, and they don’t tell the same story.

vals.ai (independent leaderboard): Fable 5 scores 95.00% on SWE-Bench Verified, per the vals.ai leaderboard. SWE-Bench Verified is a different benchmark from SWE-Bench Pro, it uses a curated subset of software engineering problems verified for quality and solvability. vals.ai is a third-party leaderboard not operated by Anthropic. That independence is meaningful. This is the benchmark figure enterprise teams can cite without a scaffold-dependency footnote.

Artificial Analysis (independent evaluator): Per Artificial Analysis’s independent evaluation, Fable 5 scores 1,932 on the GDPval-AA benchmark. Artificial Analysis runs evaluations across multiple frontier models and publishes comparative data. It’s a credible source. The GDPval-AA benchmark measures general-purpose development productivity across a range of task types, it’s not identical to SWE-Bench, but it’s independently administered and Fable 5’s score is competitive.

Anthropic system card (vendor-reported, own scaffold): Anthropic’s announcement places Fable 5 at 80.3% on SWE-Bench Pro. This is where the scaffold dependency lives. SWE-Bench Pro evaluations require a scaffolding layer that manages how the model interacts with the test environment. Anthropic ran their evaluation using their own scaffolding. Independent aggregators, including morphllm.com, note this explicitly, the 80.3% figure reflects performance under Anthropic’s evaluation conditions, not under a neutral harness that other models were also tested on. That’s not a disqualifying finding. It is a required footnote for any honest comparison.

The Contested Leaderboard

The competitive picture at the top of SWE-Bench Pro is genuinely unsettled, and the unsettledness has a specific shape.

Multiple leaderboard entries now claim top-of-table positioning on coding benchmarks, and they can’t all be right. Anthropic’s system card places GPT-5.5 at 58.6% on SWE-Bench Pro, a figure that would make Fable 5’s 80.3% a commanding lead. But that 58.6% figure for GPT-5.5 isn’t independently confirmed in available cross-references. Independent aggregators instead show GPT-5.5 entries via DeepSWE at or near the top of the coding leaderboard by different measures, suggesting the competitive ordering flips depending on which harness and which benchmark variant you use.

Don’t expect a clean answer from any single leaderboard. The benchmark ecosystem for frontier coding models is currently in a contested state where the same model can rank first or fifth depending on evaluation methodology. This isn’t a scandal, it’s a known limitation of leaderboard-based model evaluation when scaffolding choices affect outcomes and vendors have incentive to choose favorable conditions.

Disputed Claim

Claude Fable 5 leads SWE-Bench Pro at 80.3%, ahead of GPT-5.5 at 58.6%

80.3% uses Anthropic's own scaffolding. The 58.6% figure for GPT-5.5 isn't independently confirmed. DeepSWE entries show different competitive orderings. Leaderboard position is scaffold-dependent and contested.

Cite SWE-Bench Verified (vals.ai, 95.00%) and GDPval-AA (Artificial Analysis, 1,932) for independent baseline. Do not use SWE-Bench Pro figures in competitive comparisons until Epoch AI independent evaluation publishes.

Warning

Independent Epoch AI evaluation of Claude Fable 5 is pending as of June 10, 2026. Epoch AI is the primary neutral-harness authority for frontier model comparison. SWE-Bench Pro competitive rankings should be treated as provisional until Epoch AI publishes.

What it means practically: any team citing “Fable 5 is the best coding model” based on the 80.3% SWE-Bench Pro figure is citing a vendor-scaffolded result in a contested field. That’s a different claim than “Fable 5 scores 95.00% on SWE-Bench Verified per an independent leaderboard.” Both statements are defensible with attribution. Only one of them is independently verified.

What the Evaluation Hierarchy Says About These Numbers

The AI benchmarking field has developed an informal but useful hierarchy for evaluating benchmark credibility. Applied to Fable 5’s figures:

Tier 2, Independent evaluation, not peer-reviewed: GDPval-AA at 1,932 (Artificial Analysis) and SWE-Bench Verified at 95.00% (vals.ai). These figures come from organizations that don’t have a commercial interest in Anthropic’s results and use their own evaluation infrastructure. They’re not peer-reviewed academic papers, but they’re meaningfully more independent than vendor self-reports.

Tier 4, Vendor internal evaluation with vendor scaffold: SWE-Bench Pro at 80.3% (Anthropic’s own scaffolding). This is what Anthropic measured under Anthropic’s conditions. It may accurately reflect the model’s capabilities under those conditions. It doesn’t tell you how the model performs under a neutral harness, and that’s the comparison that matters when you’re evaluating against other models that were tested differently.

The pattern of contested benchmark leadership claims is familiar from prior pipeline cycles, MAI-Thinking-1, Claude Opus 4.8, and now Fable 5. The scaffold-dependency problem isn’t new. The field hasn’t solved it.

The Epoch AI Gap

Epoch AI is the closest thing frontier model evaluation has to a neutral authority. Their evaluations are independent, methodologically documented, and widely cited as a credible cross-model comparison source. As of June 10, 2026, Epoch AI has not published an independent evaluation of Claude Fable 5. The Epoch AI capabilities page for Fable 5 is unavailable, the URL is broken as of this writing.

That gap is the most important thing this brief can tell you. If your deployment decision depends on a SWE-Bench Pro comparison between Fable 5 and GPT-5.5, you’re comparing a vendor-scaffolded figure against an unconfirmed competitor figure, in a benchmark category where the competitive ordering is actively disputed. Epoch AI’s evaluation, when it publishes, will give you the neutral-harness comparison the current data can’t provide.

Until then: the vals.ai and Artificial Analysis numbers are what you have that’s independently grounded.

What to Watch

Epoch AI independent evaluation of Claude Fable 5Unknown, pending as of 2026-06-10

Neutral-harness SWE-Bench Pro re-evaluation (third-party)Q3 2026 expected

DeepSWE methodology disclosure and harness documentationOngoing

OpenAI independent benchmark publication for GPT-5.5 under neutral conditionsUnknown

Unanswered Questions

Which scaffolding did Anthropic use for the SWE-Bench Pro evaluation, and has it been published for replication?
What harness did DeepSWE use to produce the GPT-5.5 results that contest Anthropic's 58.6% figure?
When will Epoch AI publish its independent evaluation of Fable 5, and will it include both SWE-Bench Pro and Verified variants?

What Teams Should Actually Do

The practical split is straightforward. Enterprise teams evaluating Fable 5 for production coding workflows have two defensible options right now.

First: use the independently verified figures, SWE-Bench Verified 95.00% (vals.ai) and GDPval-AA 1,932 (Artificial Analysis), as your benchmark baseline. These don’t require a scaffold-dependency footnote. They won’t shift when Epoch AI publishes.

Second: run your own internal evaluation on your specific codebase and task distribution. The Stripe migration claim, Stripe reported Fable 5 completed a 50-million-line Ruby codebase migration in a single day, a task estimated to take a human team more than two months, is striking. It’s also Stripe-reported and distributed via Anthropic’s announcement materials. The corroboration is consistent but downstream. Stripe is a testing partner, not a neutral evaluator. Your codebase isn’t a 50-million-line Ruby monolith. Internal evaluation on your own task distribution is worth more than any leaderboard position for a production decision.

Two confirmed facts cut through the benchmark noise and matter for safety-sensitive deployments: Fable 5’s safeguard system triggers in less than 5% of sessions and falls back to Claude Opus 4.8 for high-risk queries. That’s confirmed from Anthropic’s primary announcement page, verbatim. It’s an architecturally significant design choice, the model is designed to be safe for general use, with Mythos 5 (same underlying model, safeguards lifted) reserved for vetted cyberdefenders under Project Glasswing. The safety architecture isn’t contested. The benchmark rankings are.

TJS Synthesis

The 80.3% SWE-Bench Pro figure will keep circulating in coverage of Fable 5 because it’s the largest number and it came first. Enterprise teams and practitioners using it for deployment comparisons should understand what they’re actually citing: a vendor-scaffolded evaluation in a benchmark category where the competitive ordering is genuinely contested and Epoch AI’s independent assessment is still pending.

The defensible decision path is this: treat SWE-Bench Verified at 95.00% (vals.ai) and GDPval-AA at 1,932 (Artificial Analysis) as your current baseline. Treat the SWE-Bench Pro 80.3% figure as a vendor claim requiring independent confirmation. Watch for Epoch AI’s evaluation, when it publishes, the field will have a neutral-harness comparison across multiple frontier models that settles the contested leaderboard question. Until then, run your own internal benchmarks on your task distribution. That’s the only comparison that’s actually about your deployment.

More coverage of Anthropic

Technology Deep Dive Jun 11

The Hidden Constraint Problem: Who Pushed Back on Anthropic's Invisible Safeguards and What the...

Technology Jun 11

Anthropic Reverses Fable 5's Hidden Safety Policy: What Visibility Actually Changes for Developers

Markets Deep Dive Jun 11

What SpaceX's $1.75T Valuation Means for the OpenAI and Anthropic IPO Windows Still Ahead

Markets Deep Dive Jun 10

Frontier Labs Are Funding Research Into the Disruption They're Causing. What That Pattern Means.

Markets Jun 10

Anthropic Commits $200M to Study AI's Job Impacts as Amodei Calls for Government Safety...

View Source

More Technology intelligence

View all Technology

Gallery

Contacts