Claude Fable 5's SWE-Bench Pro Score Is Contested: What Independent Evaluators Actually Confirm

June 10, 2026 2 min read Anthropic Partial Strong

Tech Jacks Solutions AI News Coverage

Anthropic's 80.3% SWE-Bench Pro figure for Claude Fable 5 was produced using Anthropic's own scaffolding, not a neutral harness, and independent leaderboard data presents a different competitive picture than the launch headline suggests. Here's what's independently verified, what's vendor-reported, and what's still pending.

claude-fable-5 swe-bench ai-benchmarks anthropic model-evaluation ai-coding independent-evaluation

SWE-Bench Verified (vals.ai), 95.00%

Key Takeaways

SWE-Bench Pro 80.3% is vendor-reported using Anthropic's own scaffolding, not a neutral harness, and is contested by independent leaderboard aggregators. vals.ai independently confirms Fable 5 at 95.00% on SWE-Bench Verified; Artificial Analysis confirms 1,932 on GDPval-AA, these are the defensible figures.
The GPT-5.5 vs. Fable 5 competitive ordering on SWE-Bench Pro is unsettled: available cross-references don't confirm Anthropic's 58.6% figure for GPT-5.5.
Epoch AI's independent evaluation of Fable 5 is pending as of June 10, 2026, hold deployment comparisons based on SWE-Bench Pro until it publishes.

Verification

Partial Anthropic system card + vals.ai independent leaderboard + Artificial Analysis SWE-Bench Pro 80.3% is vendor-reported using Anthropic's own scaffolding. Epoch AI independent evaluation pending as of June 10, 2026.

Benchmark Scores, Source and Evaluation Type

Fable 5, SWE-Bench Verified (vals.ai, independent)

95.00%

Fable 5, GDPval-AA (Artificial Analysis, independent)

1,932

Fable 5, SWE-Bench Pro (Anthropic scaffold, vendor-reported)

80.3%

Opus 4.8, SWE-Bench Pro (Anthropic system card)

69.2%

GPT-5.5, SWE-Bench Pro (Anthropic system card only)

58.6%

The number that headlined Claude Fable 5’s launch has a footnote. Anthropic’s system card places Fable 5 at 80.3% on SWE-Bench Pro, but that figure was produced using Anthropic’s own scaffolding, not a neutral evaluation harness. That distinction matters more than most launch coverage acknowledged.

Two figures hold up to independent scrutiny. The vals.ai independent leaderboard confirms Fable 5 at 95.00% on SWE-Bench Verified, a separate benchmark from SWE-Bench Pro, and a more defensible number because vals.ai is a third-party leaderboard Anthropic doesn’t control. Per Artificial Analysis’s independent evaluation, Fable 5 scores 1,932 on the GDPval-AA benchmark. Those two data points are the ones enterprise teams can build on right now.

The SWE-Bench Pro 80.3% figure is different. Cross-reference data from independent aggregators notes the scaffold-dependency explicitly: the score reflects performance when Anthropic’s own tooling runs the evaluation, not when a neutral harness does. Independent leaderboard data also shows a contested picture at the top of the coding benchmark rankings, DeepSWE entries place GPT-5.5 at or near the top of coding leaderboards by different measures, and the morphllm.com aggregator notes that multiple systems claim SWE-Bench Pro leadership depending on which scaffold and harness are used. Anthropic’s system card places GPT-5.5 at 58.6% on SWE-Bench Pro, but that figure isn’t independently confirmable from available cross-references. The competitive ordering at the top of the coding benchmark table is genuinely unsettled.

Disputed Claim

Claude Fable 5 leads SWE-Bench Pro at 80.3%

Score produced using Anthropic's own scaffolding, not a neutral harness. DeepSWE entries and independent aggregators present different competitive orderings. GPT-5.5 58.6% figure not independently confirmable.

Use SWE-Bench Verified (vals.ai) and GDPval-AA (Artificial Analysis) for deployment decisions. Wait for Epoch AI independent evaluation before citing SWE-Bench Pro in comparative analysis.

The part nobody mentions in launch coverage: Epoch AI’s independent evaluation of Fable 5 is pending as of June 10, 2026. Epoch AI is the closest thing the field has to a neutral authority on frontier model capabilities. Until that evaluation publishes, every SWE-Bench Pro comparison involving Anthropic’s scaffolding carries an asterisk.

Two things are confirmed without qualification. Fable 5’s safeguard system triggers in less than 5% of sessions and redirects high-risk queries to Claude Opus 4.8, that’s confirmed from Anthropic’s primary announcement page, verbatim. Claude Mythos 5, same underlying model, safeguards lifted, is available to a small group of vetted cyberdefenders and infrastructure providers under Project Glasswing. Those facts aren’t contested.

Stripe reported during early testing that Fable 5 completed a 50-million-line Ruby codebase migration in a single day, a task Stripe estimated would take a human engineering team more than two months. That’s a striking benchmark. It’s also Stripe-reported and distributed through Anthropic’s announcement materials. The T3 corroboration is consistent but downstream. “Stripe reported” is the right framing, not “Fable 5 proved.”

What to Watch

Epoch AI independent evaluation of Claude Fable 5Unknown, pending as of 2026-06-10

DeepSWE leaderboard updates and harness methodology disclosureOngoing

Neutral-harness SWE-Bench Pro re-evaluation by third partiesQ3 2026 expected

The catch is that teams evaluating Fable 5 for production coding workflows are currently working with an incomplete picture. SWE-Bench Verified at 95.0% (vals.ai) and GDPval-AA at 1,932 (Artificial Analysis) are the independently grounded figures. SWE-Bench Pro at 80.3% is vendor-reported under vendor conditions. These aren’t the same thing, and the difference matters for any team using benchmark comparisons to justify a migration decision.

Wait for Epoch AI’s independent evaluation before using SWE-Bench Pro figures in a deployment comparison. The vals.ai and Artificial Analysis numbers are the defensible starting point.