The number that headlined Claude Fable 5’s launch has a footnote. Anthropic’s system card places Fable 5 at 80.3% on SWE-Bench Pro, but that figure was produced using Anthropic’s own scaffolding, not a neutral evaluation harness. That distinction matters more than most launch coverage acknowledged.
Two figures hold up to independent scrutiny. The vals.ai independent leaderboard confirms Fable 5 at 95.00% on SWE-Bench Verified, a separate benchmark from SWE-Bench Pro, and a more defensible number because vals.ai is a third-party leaderboard Anthropic doesn’t control. Per Artificial Analysis’s independent evaluation, Fable 5 scores 1,932 on the GDPval-AA benchmark. Those two data points are the ones enterprise teams can build on right now.
The SWE-Bench Pro 80.3% figure is different. Cross-reference data from independent aggregators notes the scaffold-dependency explicitly: the score reflects performance when Anthropic’s own tooling runs the evaluation, not when a neutral harness does. Independent leaderboard data also shows a contested picture at the top of the coding benchmark rankings, DeepSWE entries place GPT-5.5 at or near the top of coding leaderboards by different measures, and the morphllm.com aggregator notes that multiple systems claim SWE-Bench Pro leadership depending on which scaffold and harness are used. Anthropic’s system card places GPT-5.5 at 58.6% on SWE-Bench Pro, but that figure isn’t independently confirmable from available cross-references. The competitive ordering at the top of the coding benchmark table is genuinely unsettled.
Disputed Claim
The part nobody mentions in launch coverage: Epoch AI’s independent evaluation of Fable 5 is pending as of June 10, 2026. Epoch AI is the closest thing the field has to a neutral authority on frontier model capabilities. Until that evaluation publishes, every SWE-Bench Pro comparison involving Anthropic’s scaffolding carries an asterisk.
Two things are confirmed without qualification. Fable 5’s safeguard system triggers in less than 5% of sessions and redirects high-risk queries to Claude Opus 4.8, that’s confirmed from Anthropic’s primary announcement page, verbatim. Claude Mythos 5, same underlying model, safeguards lifted, is available to a small group of vetted cyberdefenders and infrastructure providers under Project Glasswing. Those facts aren’t contested.
Stripe reported during early testing that Fable 5 completed a 50-million-line Ruby codebase migration in a single day, a task Stripe estimated would take a human engineering team more than two months. That’s a striking benchmark. It’s also Stripe-reported and distributed through Anthropic’s announcement materials. The T3 corroboration is consistent but downstream. “Stripe reported” is the right framing, not “Fable 5 proved.”
What to Watch
The catch is that teams evaluating Fable 5 for production coding workflows are currently working with an incomplete picture. SWE-Bench Verified at 95.0% (vals.ai) and GDPval-AA at 1,932 (Artificial Analysis) are the independently grounded figures. SWE-Bench Pro at 80.3% is vendor-reported under vendor conditions. These aren’t the same thing, and the difference matters for any team using benchmark comparisons to justify a migration decision.
Wait for Epoch AI’s independent evaluation before using SWE-Bench Pro figures in a deployment comparison. The vals.ai and Artificial Analysis numbers are the defensible starting point.