Artificial Analysis confirmed it on May 28: Claude Opus 4.8 is their new top-ranked model, scoring 61.4 on the Intelligence Index – a 4.1-point gain over Opus 4.7. That’s a real, independently published result from a third-party evaluator. Everything else Anthropic announced requires closer reading.
Anthropic’s release reports 88.6% on SWE-bench Verified, 74.6% on Terminal-Bench 2.1, 45.7% on Humanity’s Last Exam under Adaptive Reasoning and Max Effort settings, and 47% on ITBench-AA. The vendor also states the model is four times less likely to allow bugs to pass in generated code compared to Opus 4.7, with a fast mode running at 2.5× speed and inference costs 3× lower than prior model generations. Stated pricing is $5.00 per million input tokens and $25.00 per million output tokens, though the primary Anthropic source wasn’t accessible at time of publication – verify pricing at the live Anthropic API page before building cost models. Context window wasn’t disclosed at launch.
None of those specific figures could be independently confirmed from available sources. That’s not unusual for a launch day. It is, however, something practitioners need to track before making platform decisions.
The benchmark naming matters more than it might seem. SWE-bench Verified and SWE-Bench Pro are distinct variants of the same benchmark family – they test different things and produce different numbers. Anthropic’s 88.6% figure references SWE-bench Verified. Cross-reference sources from Inc. cite 69.2% on SWE-Bench Pro. Those aren’t contradictory; they’re measuring different problems. Teams comparing Opus 4.8 against competitors need to check which variant each vendor is reporting before drawing conclusions.
Disputed Claim
The catch is that launch-day benchmarks almost always favor the vendor’s chosen test conditions. Adaptive Reasoning and Max Effort settings on HLS, for example, represent an upper-bound evaluation mode – not typical deployment configuration. The 45.7% figure is notable if replicated independently. At base settings, that number will look different.
For teams evaluating agentic coding platforms right now, the Artificial Analysis ranking is the most actionable data point available. It’s a composite index across multiple dimensions, independently scored, and published the same day as the launch. The specific SWE-bench and HLS figures from Anthropic are directional signals – worth noting, not worth building roadmaps around until independent replication arrives.
Don’t expect the Epoch AI evaluation to appear immediately. The Epoch AI benchmarks dashboard is the right place to check when a model-specific evaluation for Opus 4.8 becomes available, but no specific Opus 4.8 record was present at publication. That evaluation, when it appears, will be the one to anchor specific benchmark claims against.
The part nobody mentions: Anthropic hasn’t disclosed context window limits for Opus 4.8. For agentic pipelines handling long-context tool chains, that’s a material gap. Plan around the unknown rather than assuming parity with Opus 4.7.
Unanswered Questions
- What is the context window for Claude Opus 4.8? Not disclosed at launch.
- What are the performance characteristics at base inference settings rather than Max Effort/Adaptive Reasoning mode?
- Has Epoch AI published a model-specific evaluation for Opus 4.8? Not available at publication - check epoch.ai/benchmarks.
This is the fifth brief in the Opus 4.8 cycle. Prior coverage addressed the core launch, upgrade cycle speed, tool-calling regression fixes, and honesty features. This brief covers the one angle those didn’t: what independent evaluators have actually confirmed versus what Anthropic claims – and what that gap means for teams making real decisions.
Wait for Epoch AI’s specific evaluation before migrating agentic coding workflows to Opus 4.8 based on the benchmark figures alone. The #1 independent ranking is real. The specific scores are Anthropic’s until proven otherwise.