GPT-5.4 Unifies Computer Use, Reasoning, and Coding: What It Means for the Agent Race

March 5, 2026 TechCrunch Confirmed

OpenAI released GPT-5.4 on March 5 as its first model to combine reasoning, coding, and native computer-use capabilities in a single system. OpenAI reports the model scored 75.0% on the OSWorld-Verified benchmark, which it states surpasses the human baseline of 72.4%. The release marks a structural shift in how frontier labs compete: the race is no longer about isolated model quality, but about which unified system can reliably operate computers, write code, and reason through multi-step tasks without human intervention.

Three days after shipping GPT-5.3 Instant as a tone and accuracy fix for ChatGPT, OpenAI dropped something much larger. GPT-5.4 launched on March 5, 2026, available in three variants: standard, Thinking, and Pro. It is OpenAI’s first model to include native computer-use capabilities alongside advanced reasoning and code generation, all accessible through a 1 million token context window on the API.

The numbers tell part of the story. OpenAI reports a 75.0% success rate on OSWorld-Verified, which it states surpasses the human benchmark of 72.4%. GPT-5.2 scored 47.3% on the same evaluation. According to OpenAI’s GDPval evaluation, GPT-5.4 reached 83.0%. The company also states the model produces 33% fewer false individual claims and 18% fewer error-containing responses than GPT-5.2, with up to 47% fewer tokens on some tasks. On Mercor’s APEX-Agents benchmark, CEO Brendan Foody confirmed GPT-5.4 ranked first.

Every one of those benchmark figures is self-reported by OpenAI or sourced from a single vendor evaluation. No independent reproduction has been published during the reporting period. The numbers deserve attention, but they deserve vendor attribution too.

The computer-use competitive landscape

GPT-5.4’s native computer use puts OpenAI into direct competition with Anthropic’s Claude computer use capabilities and Google’s project Mariner. The difference is framing. Anthropic shipped computer use as a beta feature layered on top of Claude. Google positioned Mariner as a Chrome-based research assistant. OpenAI built computer use into the model architecture itself, making it available across all three GPT-5.4 variants rather than as an add-on.

That architectural decision matters for developers building agent workflows. A unified model that can reason through a problem, write code to solve it, and operate a computer interface to execute it removes the need to chain separate models together. It does not eliminate the need for guardrails, error handling, or human oversight. But it changes the integration complexity from multi-model orchestration to single-model prompting.

OpenAI also introduced a Tool Search system for the API, which improves how the model selects and calls tools during agent workflows. For teams building production agents, this reduces the prompt engineering overhead of managing large tool inventories. OpenAI’s safety evaluation shows reduced chain-of-thought deception risk in the Thinking variant, an important signal for developers who need to trust the model’s intermediate reasoning steps.

The enterprise play

GPT-5.4’s tiering tells you where OpenAI sees the money. Thinking is available to Plus, Team, and Pro subscribers. Pro is restricted to Pro, Enterprise, and API users. Free-tier users get auto-routed access to standard GPT-5.4. The premium variants are gated behind the subscription tiers that enterprise buyers occupy.

OpenAI’s internal investment banking benchmark showed performance jumping from 43.7% to 87.3% with GPT-5.4. That figure is an OpenAI internal evaluation with no independent verification, but it signals the use case OpenAI is targeting: financial professionals who need a model that can read documents, run calculations, and operate spreadsheet interfaces. The ChatGPT for Excel integration with data providers like FactSet, LSEG, Daloopa, and S&P adds practical infrastructure around that ambition.

For enterprise technology buyers evaluating model vendors, the question is not whether GPT-5.4’s benchmarks hold up. It is whether a single model that handles reasoning, coding, and computer use reduces the total cost of building AI-powered workflows compared to stitching together specialized models from different providers.

What developers need to decide now

GPT-5.2 Thinking retires in three months. The Codex app shipped on Windows the same day, enabling parallel coding agents in isolated worktrees. And OpenAI retired six legacy models on March 6: GPT-4o, GPT-4.1, GPT-4.1 mini, o4-mini, GPT-5 Instant, and GPT-5 Thinking. That is a lot of migration pressure in a single week.

The legacy retirements are not optional. Teams still on GPT-4o or GPT-4.1 have hit end-of-life. Teams on GPT-5.2 Thinking have a 90-day window before they face the same deadline.

All of this lands against a user-trust backdrop worth noting. TechCrunch reported that ChatGPT uninstalls surged 295% following OpenAI’s Department of Defense deal. OpenAI is simultaneously shipping its most capable model ever and watching a segment of its consumer base walk away over policy decisions unrelated to model quality.

The technical achievement is real. A model that scores above the human baseline on computer-use tasks while handling million-token contexts and producing fewer errors than its predecessor represents genuine progress. Whether those self-reported benchmarks survive independent testing is the next question. For now, OpenAI has set the bar that Anthropic and Google need to clear with their next releases.

View Source

More Technology intelligence

View all Technology