Agentic Coding Platforms: What Google, Anthropic, and OpenAI Actually Built, and What Enterprise Teams Can Verify

May 21, 2026 5 min read Artificial Analysis Partial

Tech Jacks Solutions AI News Coverage

Google, Anthropic, and OpenAI each shipped an agentic coding platform in the same two-week window in May 2026. They have different architectures, different pricing structures, and now different talent strategies. Enterprise teams evaluating which platform to build on are looking at three genuinely distinct bets, and the independent benchmark data is thin enough that the architectural differences matter more than the launch announcements.

gemini-3-5-flash agentic-coding google-antigravity claude-code openai-codex agentic-ai-benchmark ai-developer-platform enterprise-ai-2026 gdpval-benchmark

GDPval-AA Elo, 1,656 (Artificial Analysis)

Key Takeaways

Google, Anthropic, and OpenAI each launched agentic coding platforms in the same two-week window with architecturally distinct approaches, sandboxed cloud, API-first, and on-premises respectively
Gemini 3.5 Flash holds the only independently verified performance anchor in this comparison: GDPval-AA Elo 1656 (Artificial Analysis), all other vendor benchmark claims lack third-party confirmation
Google's speed claims (4x baseline, 12x inside Antigravity) are vendor-self-reported; the 12x figure applies specifically to the Antigravity sandbox environment, not raw API access
Karpathy's arrival at Anthropic signals a recursive pre-training research investment, a roadmap bet on 2027 capability development, not a near-term product update
Enterprise teams should evaluate against their own data residency and security constraints before comparing benchmark numbers, and wait for Epoch AI evaluation before making migration decisions

Agentic Coding Platform Architecture (May 2026)

Google Antigravity 2.0

Sandboxed cloud execution (Google-managed Linux sandbox)

Anthropic Claude Code

API-first developer tooling + pre-training research investment

OpenAI Codex (Dell)

On-premises enterprise deployment

Two weeks. Three platforms. No shared benchmarks.

That’s the state of the agentic coding market after Google I/O 2026. Google shipped Antigravity 2.0 and Gemini 3.5 Flash. Anthropic advanced Claude Code with Andrej Karpathy joining to lead a new pre-training research direction. OpenAI extended Codex with a Dell on-premises deployment option targeting enterprises that can’t send code to a cloud sandbox. Every vendor claims their approach is the right architecture. The independent evaluation data needed to adjudicate that claim doesn’t exist yet, with one exception.

That exception is the GDPval-AA Elo leaderboard. Artificial Analysis, a third-party model evaluation service, assigned Gemini 3.5 Flash an Elo score of 1656, placing it among the top performers on the GDPval-AA agentic task benchmark, behind GPT-5.5 (Elo 1769) and GPT-5.4 (Elo 1674). GDPval-AA measures planning, tool-use execution, and multi-turn task completion, the core loop of agentic coding workflows. That’s a meaningful independent data point in a landscape where most numbers come from vendor press releases.

The catch is what GDPval-AA doesn’t measure. Production throughput. Latency under concurrent load. Cost per successfully completed task at volume. Those are the variables that determine whether a model that wins benchmarks can hold up in a real deployment. Epoch AI has not yet evaluated Gemini 3.5 Flash. No independent lab has reproduced Google’s stated speed ratios, 4x faster at baseline, 12x inside the Antigravity environment, as of publication. The gap between benchmark performance and production reality is where platform decisions go wrong.

Three Architectures, Three Bets

The platforms aren’t just different models. They embed different architectural assumptions about where agentic execution should happen and who controls the environment.

Google’s bet is sandboxed cloud execution. Per Google’s API documentation, Managed Agents within Antigravity provide a hosted Linux sandbox in Google Cloud. Code runs in Google’s environment, browsing happens in Google’s environment, and the Interactions API manages the tool-use loop. The 12x speed claim applies inside this sandbox, not at the raw API layer. Teams that want Antigravity’s full performance envelope accept that their execution environment is Google-managed. For enterprises with strict data residency requirements or security controls on code execution, that’s a non-trivial constraint.

Anthropic’s bet is API-first developer tooling with a longer-horizon pre-training investment. Claude Code has earned significant enterprise developer adoption, per the Ramp Index, Claude’s share of enterprise AI spend has been climbing in 2026. Karpathy’s arrival signals something more specific than a hiring headline. His stated focus, per Anthropic’s announcement, is recursive model-in-the-loop pre-training, using Claude itself to accelerate Claude’s own development cycles. That’s a pre-training research direction, not a near-term product feature. Its implications for the platform’s capability trajectory matter more for 2027 decisions than for today’s integration choices.

OpenAI’s Codex bet is on-premises enterprise deployment. The Dell partnership extends Codex into environments where cloud-based execution isn’t an option, regulated industries, defense contractors, enterprises with egress restrictions on source code. The developer stack control question this raises is real: on-premises deployment means slower model updates, enterprise dependency on hardware refresh cycles, and a different support relationship than API-first competitors offer.

What the Benchmark Data Actually Shows

Be precise about what’s verified here and what isn’t.

Agentic Coding Platform Positions, May 2026

Google (Antigravity 2.0 + Gemini 3.5 Flash)

for

Sandboxed cloud execution; speed-optimized; free tier for developer acquisition

Anthropic (Claude Code + Karpathy)

for

API-first; growing enterprise spend share; pre-training research bet on next-generation capability

OpenAI (Codex + Dell)

for

On-premises deployment for regulated/restricted-egress enterprises

Evidence

Gemini 3.5 Flash is 4x faster than frontier models at baseline and 12x faster inside Antigravity

Google-only benchmarking; no independent reproduction as of 2026-05-21

Unanswered Questions

Does the GDPval-AA Elo score hold under production throughput conditions, or only in benchmark batch testing?
What are the data residency implications of Google-managed Antigravity sandbox execution for enterprises with code egress restrictions?
When will Epoch AI publish an independent evaluation covering throughput and latency, not just task success?
Does Karpathy's pre-training focus imply a significant model version transition for Claude Code before year-end?

Verified with independent sourcing: Gemini 3.5 Flash scored 1656 Elo on GDPval-AA per Artificial Analysis. That’s one benchmark from one evaluator, covering agentic task performance. It’s the strongest independent data point in this comparison.

Vendor-stated, not independently verified: Google’s 4x and 12x speed ratios. Anthropic’s Claude Code performance claims from internal evaluations. OpenAI’s Codex capability claims for the Dell on-premises deployment.

Don’t expect a clean comparison table here. Any table that places unverified vendor figures next to independently sourced ones will mislead readers about what the data actually supports. The honest state of the benchmark landscape is that GDPval-AA gives you one anchor for Gemini 3.5 Flash, and you’re extrapolating from there.

The Talent Signal

Karpathy’s move to Anthropic matters for a specific reason that the hiring headline obscures.

Pre-training is where foundation model capabilities are set. It’s also the highest-cost phase of model development, the part that takes months and hundreds of millions in compute before producing anything testable. Most AI companies are in a post-pre-training optimization phase right now: fine-tuning, RLHF, inference optimization. A researcher of Karpathy’s caliber joining to lead a recursive pre-training research direction suggests Anthropic is making a bet that the next significant capability jump comes from pre-training innovation, not post-training refinement.

That’s a different platform thesis than Google’s inference speed focus or OpenAI’s enterprise deployment approach. Whether it’s right is a 2027 question. But enterprise teams building on Claude Code today should understand that Anthropic’s roadmap includes a significant pre-training research push, which means a capability step-change could arrive with a model version transition, not a gradual update.

The Enterprise Decision Framework

Don’t decide now if you don’t have to. The honest recommendation is a structured evaluation pause.

What to Watch

Epoch AI independent evaluation of Gemini 3.5 FlashTBD, not yet scheduled

Official Google pricing page confirmation for Gemini 3.5 Flash APIImmediate, check before production budgeting

Independent reproduction of 4x and 12x speed claims by third-party labLikely weeks post-launch

Anthropic pre-training research outcomes, Claude version transition signal2027 horizon

Analysis

The three-platform convergence in May 2026 confirms that agentic coding has moved from experimental to vendor-committed. That's a meaningful signal. But vendor commitment doesn't equal production readiness, and this particular window has more unverified performance claims than is typical for a mature competitive market. The evaluation gap, between benchmark scores and production throughput data, is the real story right now.

Here’s what to actually do this week: Test the free Gemini 3.5 Flash tier inside Antigravity on real tasks from your codebase. Benchmark your own workflows, not vendor benchmarks, against the tools you’re already using. Document your data residency requirements before the architecture conversation, not after.

Here are the triggers that should move you from evaluation to decision:

Epoch AI publishes an independent Gemini 3.5 Flash evaluation. That’s the first benchmark that covers throughput and latency systematically. It’ll give you data that GDPval-AA doesn’t.

Google confirms official API pricing. Community-reported figures ($1.50 per million input tokens, $9.00 per million output tokens) are plausible but unconfirmed. At production token volumes, the difference between a community estimate and the actual price is material.

An independent lab reproduces or refutes Google’s speed ratios. The 4x and 12x figures are the most consequential vendor claims in this cycle. They’re also the least verified.

The three-platform convergence in May 2026 is real, and it matters. Agentic coding has moved from experimental to vendor-committed in a single I/O cycle. But vendor commitment isn’t the same as production readiness, and this particular competitive window has more unverified claims than usual. The infrastructure stack analysis from May 19 covers the broader ecosystem context for teams who want the architecture picture alongside the competitive one.

TJS synthesis: GDPval-AA Elo 1656 gives Gemini 3.5 Flash one credible external ranking. It’s not enough to build a platform decision on. The three-platform moment is real, Google, Anthropic, and OpenAI have converged on agentic coding as the primary developer battleground. But their architectures solve different problems: sandboxed cloud execution, API-first developer velocity, and on-premises enterprise deployment aren’t interchangeable. Match the architecture to your constraints first. Then wait for Epoch AI before comparing performance numbers across vendors.

View Source

More Technology intelligence

View all Technology

Gallery

Contacts