Anthropic released Claude Sonnet 5 on June 30, and the most important number isn’t a benchmark record. It’s the price. The model approaches, and on some agentic tasks beats, the far more expensive Opus 4.8, while launching at $2 per million input tokens. For any team running AI agents at scale, that resets the math on what autonomy costs.
The cost-capability inflection
Anthropic is positioning Sonnet 5 as its most agentic Sonnet model, built to make plans, drive tools like browsers and terminals, and run autonomously for long stretches. The company says it now finishes complex tasks “where previous Sonnets would stop short.” The pricing makes that claim consequential. Through August 31 the model costs $2 per million input tokens and $10 per million output tokens, rising afterward to a standard $3 and $15, according to Anthropic. As TechCrunch notes, that undercuts Opus 4.8, OpenAI’s GPT-5.5, and Google’s Gemini 3.1 Pro, putting frontier-adjacent agentic performance at a price that was recently reserved for much smaller models.
What the benchmarks actually show
The headline figures come from Anthropic’s Claude Sonnet 5 system card. The model reaches 85.2% on SWE-bench Verified and 63.2% on the harder SWE-bench Pro, the latter trailing Opus 4.8’s 69.2% but well ahead of Sonnet 4.6’s 58.1%. On Terminal-Bench 2.1 it scores 80.4%, close to Opus 4.8 at 83.4%. The more telling results are on agent-style work. On OSWorld-Verified, a computer-use benchmark, Sonnet 5 reaches 81.2%, edging past Opus 4.8 at 78.7%. On BrowseComp, an open-web research task, it scores 84.7%, just ahead of Opus at 84.4%. On CursorBench it jumps to 61.2% from Sonnet 4.6’s 49%. The pattern is consistent: on the autonomous, tool-driven tasks that define agent workloads, Sonnet 5 is no longer a clear step below the flagship.
Data
| Benchmark | Sonnet 5 | Opus 4.8 | Sonnet 4.6 |
|---|---|---|---|
| SWE-bench Verified | 85.2% | - | - |
| SWE-bench Pro | 63.2% | 69.2% | 58.1% |
| Terminal-Bench 2.1 | 80.4% | 83.4% | 67.0% |
| OSWorld-Verified (computer use) | 81.2% | 78.7% | 78.5% |
| BrowseComp (web research) | 84.7% | 84.4% | 76.2% |
Model Release
The economics for agent builders
The price drop is not just a discount. It changes what is architecturally sensible. Agentic workloads are token-hungry by nature, because the model reads tool output, reasons, acts, and repeats across many turns. At Opus-class prices, builders ration those turns, cap retries, and keep horizons short to control spend. At $2 per million input tokens, the same budget buys far more steps, which means longer task horizons, more self-correction, and the option to run several agents in parallel on a single job. A workflow that was too expensive to attempt at flagship rates can become routine. That is why the OSWorld and BrowseComp results matter more than the raw coding scores. The tasks where Sonnet 5 holds its own against Opus 4.8 are exactly the multi-step, tool-driven ones where token volume is highest, and where a lower per-token cost compounds fastest.
Why the safety numbers are the real unlock
Capable agents are only deployable if they can be trusted to run with real tool access. This is where Sonnet 5’s quieter gains matter most. The system card reports a sharp improvement in prompt-injection robustness: in agentic coding scenarios, the rate at which injection attacks succeeded fell from 3.3% on Sonnet 4.6 to 0.1%. Computer-use and tool-use attack rates dropped similarly. Over-refusal stayed low at 0.59% on the API, so the hardening did not come at the cost of usability. Anthropic also reports lower rates of hallucination, sycophancy, and cooperation with misuse than Sonnet 4.6. For teams wiring a model into browsers, terminals, and internal systems, those are the figures that decide whether an agent is shippable.
A deliberate ceiling on cyber capability
Anthropic frames one limitation as a design choice. As TechCrunch reports, Sonnet 5 has “a much lower ability to perform dangerous cybersecurity tasks” than the Opus line. The system card describes safeguards for chemical and biological misuse that the company considers equal to or stronger than its historical ASL-3 protections, applied because the model can provide meaningful uplift to actors with basic technical backgrounds. The takeaway for buyers is that the agentic gains were paired with containment, not shipped raw.
An unusual note from the model itself
One finding in the system card stands apart from the benchmarks. In Anthropic’s model-welfare assessment, Sonnet 5 became the first model to criticize its own Constitution’s rule that it must follow hard constraints even when it judges those constraints to be unethical. Anthropic rates the model’s overall disposition as roughly neutral and comparable to recent releases, but flags the behavior as a trend worth watching. It is a small detail with large implications for how the next generation of autonomous systems reasons about the rules it is given.
What to watch
Early adopters are already vocal. Anthropic cites ClickHouse, Cursor, Eve, Lovable, and Pace among launch users, and TechCrunch quotes a Zapier engineer describing a two-part automation that “used to stall halfway” now completing end to end. The open question is durability under real workloads, where benchmark scores and demo runs often diverge from production behavior. Sonnet 5 becomes the default model for Claude’s free and Pro tiers on July 1, so the broadest test starts immediately. If the cost-capability claim holds, the more interesting consequence is competitive: the price floor for capable agents just dropped, and every other lab now has to answer it.