The launch already happened. On May 19, Google announced Gemini 3.5 Flash at I/O 2026 alongside Antigravity 2.0, its multi-agent orchestration platform replacing Gemini CLI. Those stories are covered. What wasn’t settled on day one was whether the benchmark numbers hold up when someone other than Google runs them.
One number now has independent support. Artificial Analysis, a third-party model evaluation service, assigned Gemini 3.5 Flash a GDPval-AA Elo score of 1656, ranking it #1 among the models it’s evaluated on that benchmark. GDPval-AA measures agentic task performance, planning, tool-use, and multi-turn execution, making it more relevant for coding agent workflows than general reasoning benchmarks. That ranking matters for teams choosing a platform, but it comes with a hard ceiling: it’s one benchmark from one evaluator. Epoch AI has not yet published an evaluation.
The speed claims are still Google’s alone. According to Google’s own published benchmarks, Gemini 3.5 Flash runs 4x faster than other frontier models at baseline and up to 12x faster when served inside the Antigravity environment. No independent lab has reproduced those ratios as of publication. They may be accurate, speed advantages at this scale are plausible, but until a third party runs the same test, they’re a vendor claim, not a fact.
The Antigravity integration is where speed numbers get complicated. Per Google’s API documentation, Managed Agents within Antigravity provide a hosted Linux sandbox in Google Cloud for code execution and web browsing. The 12x figure applies inside that environment, not at the raw API layer. Teams building outside Antigravity should expect the 4x baseline claim, not 12x, and even that figure needs independent verification.
Disputed Claim
Don’t expect the pricing picture to be fully clear yet. Pricing has been reported at $1.50 per million input tokens and $9.00 per million output tokens per community sources, but official pricing page confirmation is pending. The free tier inside Antigravity is confirmed as a developer acquisition move, though the strategic framing, that it targets users of paid tools like OpenAI Codex, comes from community interpretation, not a Google statement.
The context window is large: 1.0 million input tokens with 66,000 output tokens, per vendor specifications. At that scale, per-token pricing matters more than it does for short-context tasks.
The part nobody mentions is what Epoch AI’s silence means right now. The GDPval-AA score gives the community one anchor. But GDPval-AA measures agentic task success, not inference cost, latency distribution, or performance under production load. Those are the variables that break models in real deployments. Until an evaluator runs Gemini 3.5 Flash at production throughput, not benchmark batch conditions, the speed claims are directional, not operational.
What to Watch
See the May 19 launch brief for full launch context and the Antigravity platform migration details.
TJS synthesis:
The GDPval-AA Elo ranking is the only independently sourced performance data point available for Gemini 3.5 Flash right now. Weight it accordingly, it’s relevant, it’s from a credible evaluator, and it’s also a single benchmark. Wait for Epoch AI evaluation and production throughput data before making platform migration decisions. The free Antigravity tier is worth testing. The vendor speed claims are not worth building around yet.