Gemini 3.5 Flash Ranks #1 on Independent Agentic Benchmark, What the GDPval-AA Score Actually Shows

May 21, 2026 3 min read Artificial Analysis Partial Very Weak

Tech Jacks Solutions AI News Coverage

Three days after Google's I/O launch, independent evaluator Artificial Analysis has assigned Gemini 3.5 Flash a GDPval-AA Elo score of 1656, ranking it #1 among evaluated agentic coding models. The launch claims are vendor-stated; this score is the first third-party data point the developer community can actually use.

gemini-3-5-flash agentic-ai-benchmark google-antigravity gdpval-benchmark agentic-coding ai-coding-agent google-io-2026

GDPval-AA Elo, 1,656 (Rank #1, Artificial Analysis)

Key Takeaways

GDPval-AA Elo score of 1656 from Artificial Analysis is the only independent benchmark confirmed for Gemini 3.5 Flash, all other benchmark figures are vendor-reported
Google states 4x speed improvement at baseline and 12x inside Antigravity; no independent lab has reproduced either figure as of publication
API pricing ($1.50/$9.00 per million tokens) comes from community reports, not an official
Google pricing page, treat as unconfirmed
Epoch AI evaluation is pending; production throughput and latency data don't yet exist from any third-party source

Model Release

Gemini 3.5 Flash

OrganizationGoogle

TypeLLM — Mid-tier

ParametersNot disclosed

Benchmark[SELF-REPORTED] Speed: 4x faster at baseline (Google); [INDEPENDENT] GDPval-AA Elo: 1656, Rank #1 (Artificial Analysis)

AvailabilityGemini API, Google AI Studio, Android Studio, Gemini Enterprise

The launch already happened. On May 19, Google announced Gemini 3.5 Flash at I/O 2026 alongside Antigravity 2.0, its multi-agent orchestration platform replacing Gemini CLI. Those stories are covered. What wasn’t settled on day one was whether the benchmark numbers hold up when someone other than Google runs them.

One number now has independent support. Artificial Analysis, a third-party model evaluation service, assigned Gemini 3.5 Flash a GDPval-AA Elo score of 1656, ranking it #1 among the models it’s evaluated on that benchmark. GDPval-AA measures agentic task performance, planning, tool-use, and multi-turn execution, making it more relevant for coding agent workflows than general reasoning benchmarks. That ranking matters for teams choosing a platform, but it comes with a hard ceiling: it’s one benchmark from one evaluator. Epoch AI has not yet published an evaluation.

The speed claims are still Google’s alone. According to Google’s own published benchmarks, Gemini 3.5 Flash runs 4x faster than other frontier models at baseline and up to 12x faster when served inside the Antigravity environment. No independent lab has reproduced those ratios as of publication. They may be accurate, speed advantages at this scale are plausible, but until a third party runs the same test, they’re a vendor claim, not a fact.

The Antigravity integration is where speed numbers get complicated. Per Google’s API documentation, Managed Agents within Antigravity provide a hosted Linux sandbox in Google Cloud for code execution and web browsing. The 12x figure applies inside that environment, not at the raw API layer. Teams building outside Antigravity should expect the 4x baseline claim, not 12x, and even that figure needs independent verification.

Disputed Claim

Gemini 3.5 Flash is up to 12x faster inside the Antigravity environment and 4x faster than other frontier models at baseline

Both figures are from Google's own benchmarking. No independent laboratory has reproduced these ratios as of May 21, 2026.

Use GDPval-AA Elo score for comparative evaluation. Wait for Epoch AI evaluation before citing speed claims in procurement or architectural decisions.

Don’t expect the pricing picture to be fully clear yet. Pricing has been reported at $1.50 per million input tokens and $9.00 per million output tokens per community sources, but official pricing page confirmation is pending. The free tier inside Antigravity is confirmed as a developer acquisition move, though the strategic framing, that it targets users of paid tools like OpenAI Codex, comes from community interpretation, not a Google statement.

The context window is large: 1.0 million input tokens with 66,000 output tokens, per vendor specifications. At that scale, per-token pricing matters more than it does for short-context tasks.

The part nobody mentions is what Epoch AI’s silence means right now. The GDPval-AA score gives the community one anchor. But GDPval-AA measures agentic task success, not inference cost, latency distribution, or performance under production load. Those are the variables that break models in real deployments. Until an evaluator runs Gemini 3.5 Flash at production throughput, not benchmark batch conditions, the speed claims are directional, not operational.

What to Watch

Epoch AI publishes independent Gemini 3.5 Flash evaluationTBD, not yet scheduled

Google official pricing page confirmation ($1.50/$9.00 per 1M tokens)Immediate, check before budgeting

Independent reproduction of 4x and 12x speed claimsLikely weeks post-launch

See the May 19 launch brief for full launch context and the Antigravity platform migration details.

TJS synthesis:

The GDPval-AA Elo ranking is the only independently sourced performance data point available for Gemini 3.5 Flash right now. Weight it accordingly, it’s relevant, it’s from a credible evaluator, and it’s also a single benchmark. Wait for Epoch AI evaluation and production throughput data before making platform migration decisions. The free Antigravity tier is worth testing. The vendor speed claims are not worth building around yet.