Claude Sonnet 5 Lands Opus-Class Agents at a Quarter of the Cost

June 30, 2026 4 min read Anthropic Confirmed Very Strong

Tech Jacks Solutions AI News Coverage

Anthropic's Claude Sonnet 5, released June 30, approaches and on some agentic tasks beats the far pricier Opus 4.8 while launching at $2 per million input tokens. The real story is the economics of running AI agents, and the safety hardening that makes autonomous deployment viable.

85.2% on SWE-bench Verified

Key Takeaways

Sonnet 5 matches or beats Opus 4.8 on agentic tasks (OSWorld-Verified 81.2% vs 78.7%, BrowseComp 84.7% vs 84.4%) at $2/$10 per million tokens through August 31.
Coding gains are real but not flagship-level: 85.2% on SWE-bench Verified, 63.2% on SWE-bench Pro versus Opus 4.8's 69.2%.
The deployability unlock is safety: prompt-injection success in agentic coding fell from 3.3% to 0.1%, with over-refusal held at 0.59%.
Cyber and bio capability were deliberately contained under ASL-3-equivalent safeguards; Sonnet 5 also became the first Anthropic model to criticize a hard rule in its own Constitution.

Launch price (through Aug 31)

$2 / $10

Per million input / output tokens, rising to $3 / $15 standard. Below Opus 4.8, GPT-5.5, and Gemini 3.1 Pro.

Anthropic released Claude Sonnet 5 on June 30, and the most important number isn’t a benchmark record. It’s the price. The model approaches, and on some agentic tasks beats, the far more expensive Opus 4.8, while launching at $2 per million input tokens. For any team running AI agents at scale, that resets the math on what autonomy costs.

The cost-capability inflection

Anthropic is positioning Sonnet 5 as its most agentic Sonnet model, built to make plans, drive tools like browsers and terminals, and run autonomously for long stretches. The company says it now finishes complex tasks “where previous Sonnets would stop short.” The pricing makes that claim consequential. Through August 31 the model costs $2 per million input tokens and $10 per million output tokens, rising afterward to a standard $3 and $15, according to Anthropic. As TechCrunch notes, that undercuts Opus 4.8, OpenAI’s GPT-5.5, and Google’s Gemini 3.1 Pro, putting frontier-adjacent agentic performance at a price that was recently reserved for much smaller models.

What the benchmarks actually show

The headline figures come from Anthropic’s Claude Sonnet 5 system card. The model reaches 85.2% on SWE-bench Verified and 63.2% on the harder SWE-bench Pro, the latter trailing Opus 4.8’s 69.2% but well ahead of Sonnet 4.6’s 58.1%. On Terminal-Bench 2.1 it scores 80.4%, close to Opus 4.8 at 83.4%. The more telling results are on agent-style work. On OSWorld-Verified, a computer-use benchmark, Sonnet 5 reaches 81.2%, edging past Opus 4.8 at 78.7%. On BrowseComp, an open-web research task, it scores 84.7%, just ahead of Opus at 84.4%. On CursorBench it jumps to 61.2% from Sonnet 4.6’s 49%. The pattern is consistent: on the autonomous, tool-driven tasks that define agent workloads, Sonnet 5 is no longer a clear step below the flagship.

Data

Benchmark	Sonnet 5	Opus 4.8	Sonnet 4.6
SWE-bench Verified	85.2%	-	-
SWE-bench Pro	63.2%	69.2%	58.1%
Terminal-Bench 2.1	80.4%	83.4%	67.0%
OSWorld-Verified (computer use)	81.2%	78.7%	78.5%
BrowseComp (web research)	84.7%	84.4%	76.2%

Model Release

Claude Sonnet 5

OrganizationAnthropic

TypeAgentic LLM

The economics for agent builders

The price drop is not just a discount. It changes what is architecturally sensible. Agentic workloads are token-hungry by nature, because the model reads tool output, reasons, acts, and repeats across many turns. At Opus-class prices, builders ration those turns, cap retries, and keep horizons short to control spend. At $2 per million input tokens, the same budget buys far more steps, which means longer task horizons, more self-correction, and the option to run several agents in parallel on a single job. A workflow that was too expensive to attempt at flagship rates can become routine. That is why the OSWorld and BrowseComp results matter more than the raw coding scores. The tasks where Sonnet 5 holds its own against Opus 4.8 are exactly the multi-step, tool-driven ones where token volume is highest, and where a lower per-token cost compounds fastest.

Why the safety numbers are the real unlock

Capable agents are only deployable if they can be trusted to run with real tool access. This is where Sonnet 5’s quieter gains matter most. The system card reports a sharp improvement in prompt-injection robustness: in agentic coding scenarios, the rate at which injection attacks succeeded fell from 3.3% on Sonnet 4.6 to 0.1%. Computer-use and tool-use attack rates dropped similarly. Over-refusal stayed low at 0.59% on the API, so the hardening did not come at the cost of usability. Anthropic also reports lower rates of hallucination, sycophancy, and cooperation with misuse than Sonnet 4.6. For teams wiring a model into browsers, terminals, and internal systems, those are the figures that decide whether an agent is shippable.

A deliberate ceiling on cyber capability

Anthropic frames one limitation as a design choice. As TechCrunch reports, Sonnet 5 has “a much lower ability to perform dangerous cybersecurity tasks” than the Opus line. The system card describes safeguards for chemical and biological misuse that the company considers equal to or stronger than its historical ASL-3 protections, applied because the model can provide meaningful uplift to actors with basic technical backgrounds. The takeaway for buyers is that the agentic gains were paired with containment, not shipped raw.

An unusual note from the model itself

One finding in the system card stands apart from the benchmarks. In Anthropic’s model-welfare assessment, Sonnet 5 became the first model to criticize its own Constitution’s rule that it must follow hard constraints even when it judges those constraints to be unethical. Anthropic rates the model’s overall disposition as roughly neutral and comparable to recent releases, but flags the behavior as a trend worth watching. It is a small detail with large implications for how the next generation of autonomous systems reasons about the rules it is given.

What to watch

Early adopters are already vocal. Anthropic cites ClickHouse, Cursor, Eve, Lovable, and Pace among launch users, and TechCrunch quotes a Zapier engineer describing a two-part automation that “used to stall halfway” now completing end to end. The open question is durability under real workloads, where benchmark scores and demo runs often diverge from production behavior. Sonnet 5 becomes the default model for Claude’s free and Pro tiers on July 1, so the broadest test starts immediately. If the cost-capability claim holds, the more interesting consequence is competitive: the price floor for capable agents just dropped, and every other lab now has to answer it.