February 24, 2025. Anthropic shipped two things at once: a model and a philosophy.
Claude 3.7 Sonnet was positioned by Anthropic as “the first hybrid reasoning model on the market”, a design that let developers choose between fast standard responses and a slower extended thinking mode that made step-by-step reasoning visible. That visibility was the architectural bet. Rather than hiding the chain of thought, Anthropic surfaced it.
Claude Code launched the same day as a limited research preview. The CLI tool was described by Anthropic as capable of navigating codebases autonomously, running git commands, and addressing bugs from the terminal, without a developer manually directing each step. It was agentic coding, offered as a command-line interface, before most teams had a clear framework for evaluating that category of tool.
Pricing landed at $3 per million input tokens and $15 per million output tokens, matching Claude 3.5 Sonnet. The model was available immediately on the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.
Anthropic reported a score of 70.3% on SWE-bench Verified, per the company’s own evaluation, this figure had not been independently verified at the time of launch, and no Epoch AI evaluation was available. Anthropic also claimed performance parity or superiority to OpenAI’s o1-preview on coding benchmarks, an assertion based solely on Anthropic’s internal evaluation at the time of release.
The extended thinking toggle was more than a UX choice. By letting a developer decide per-call whether to pay for slower step-by-step reasoning, Anthropic priced cognitive depth as an option rather than a fixed tier. That distinction matters for production systems where most requests are cheap and a minority need careful reasoning. Teams could route traffic accordingly instead of committing to a single model tier.
Looking back from April 2026, this launch is notable less for its benchmark claims than for the template it set: a reasoning toggle paired with an autonomous coding agent. That pairing became a pattern across vendors within twelve months, and the commercial question shifted from whether to ship reasoning to how to price it per call.