AI Models News: Claude Opus 4.8 Hits #1 on Artificial Analysis - But Most Benchmark Claims Are Vendor-Only

May 29, 2026 3 min read Artificial Analysis Partial Moderate

Tech Jacks Solutions AI News Coverage

Artificial Analysis independently ranked Claude Opus 4.8 as the top AI model on their Intelligence Index as of May 28, 2026, with a confirmed score of 61.4 - up 4.1 points from Opus 4.7. Most of Anthropic's specific benchmark figures, including an 88.6% SWE-bench Verified score, come from the vendor's own release and couldn't be confirmed from available independent sources.

ai-models-news claude-opus agentic-coding benchmark-verification anthropic artificial-analysis

Artificial Analysis Intelligence Index, 61.4

Key Takeaways

Artificial Analysis independently confirmed Claude Opus 4.8 as the #1 model on their Intelligence Index (score: 61.4, +4.1 vs. Opus 4.7) as of May 28, 2026
Anthropic reports 88.6% on SWE-bench Verified and 45.7% on Humanity's Last Exam - neither figure was independently confirmed from available sources; both require vendor attribution
SWE-bench Verified and SWE-Bench Pro are distinct benchmark variants producing different scores (88.6% vs. 69.2%) - teams comparing models across vendors must verify which variant each claims
Context window not disclosed at launch; Epoch AI's model-specific evaluation was not available at publication - check epoch.ai/benchmarks for updates

Model Release

Claude Opus 4.8

OrganizationAnthropic

TypeLLM — Flagship

ParametersNot disclosed

BenchmarkArtificial Analysis Intelligence Index: 61.4 (#1 ranked). [SELF-REPORTED] SWE-bench Verified: 88.6%; HLE (Adaptive Reasoning, Max Effort): 45.7%

AvailabilityAnthropic API; reportedly Amazon Bedrock and Google Cloud Vertex AI (vendor-stated, source not independently confirmed)

Artificial Analysis confirmed it on May 28: Claude Opus 4.8 is their new top-ranked model, scoring 61.4 on the Intelligence Index – a 4.1-point gain over Opus 4.7. That’s a real, independently published result from a third-party evaluator. Everything else Anthropic announced requires closer reading.

Anthropic’s release reports 88.6% on SWE-bench Verified, 74.6% on Terminal-Bench 2.1, 45.7% on Humanity’s Last Exam under Adaptive Reasoning and Max Effort settings, and 47% on ITBench-AA. The vendor also states the model is four times less likely to allow bugs to pass in generated code compared to Opus 4.7, with a fast mode running at 2.5× speed and inference costs 3× lower than prior model generations. Stated pricing is $5.00 per million input tokens and $25.00 per million output tokens, though the primary Anthropic source wasn’t accessible at time of publication – verify pricing at the live Anthropic API page before building cost models. Context window wasn’t disclosed at launch.

None of those specific figures could be independently confirmed from available sources. That’s not unusual for a launch day. It is, however, something practitioners need to track before making platform decisions.

The benchmark naming matters more than it might seem. SWE-bench Verified and SWE-Bench Pro are distinct variants of the same benchmark family – they test different things and produce different numbers. Anthropic’s 88.6% figure references SWE-bench Verified. Cross-reference sources from Inc. cite 69.2% on SWE-Bench Pro. Those aren’t contradictory; they’re measuring different problems. Teams comparing Opus 4.8 against competitors need to check which variant each vendor is reporting before drawing conclusions.

Disputed Claim

88.6% on SWE-bench Verified; 4× fewer bugs than Opus 4.7; 2.5× fast mode speed; 3× cost reduction

All figures are vendor-reported via Anthropic's release. Primary source was inaccessible at publication. Cross-references returned a different benchmark variant (SWE-Bench Pro, 69.2%) that doesn't confirm or contradict the SWE-bench Verified claim.

Treat as directional signals. Wait for Epoch AI or Artificial Analysis model-specific evaluation before anchoring platform decisions to these numbers.

The catch is that launch-day benchmarks almost always favor the vendor’s chosen test conditions. Adaptive Reasoning and Max Effort settings on HLS, for example, represent an upper-bound evaluation mode – not typical deployment configuration. The 45.7% figure is notable if replicated independently. At base settings, that number will look different.

For teams evaluating agentic coding platforms right now, the Artificial Analysis ranking is the most actionable data point available. It’s a composite index across multiple dimensions, independently scored, and published the same day as the launch. The specific SWE-bench and HLS figures from Anthropic are directional signals – worth noting, not worth building roadmaps around until independent replication arrives.

Don’t expect the Epoch AI evaluation to appear immediately. The Epoch AI benchmarks dashboard is the right place to check when a model-specific evaluation for Opus 4.8 becomes available, but no specific Opus 4.8 record was present at publication. That evaluation, when it appears, will be the one to anchor specific benchmark claims against.

The part nobody mentions: Anthropic hasn’t disclosed context window limits for Opus 4.8. For agentic pipelines handling long-context tool chains, that’s a material gap. Plan around the unknown rather than assuming parity with Opus 4.7.

Unanswered Questions

What is the context window for Claude Opus 4.8? Not disclosed at launch.
What are the performance characteristics at base inference settings rather than Max Effort/Adaptive Reasoning mode?
Has Epoch AI published a model-specific evaluation for Opus 4.8? Not available at publication - check epoch.ai/benchmarks.

This is the fifth brief in the Opus 4.8 cycle. Prior coverage addressed the core launch, upgrade cycle speed, tool-calling regression fixes, and honesty features. This brief covers the one angle those didn’t: what independent evaluators have actually confirmed versus what Anthropic claims – and what that gap means for teams making real decisions.

Wait for Epoch AI’s specific evaluation before migrating agentic coding workflows to Opus 4.8 based on the benchmark figures alone. The #1 independent ranking is real. The specific scores are Anthropic’s until proven otherwise.