Is Gemini 3.1 Pro cheaper than Claude Opus 4.8?

Yes, by a wide margin. Gemini 3.1 Pro Preview costs $2 per million input tokens and $12 per million output at or below 200K context, versus $5 input and $25 output for Claude Opus 4.8 at standard speed. That is roughly 2.5x cheaper on input. Claude's Opus 4.7-plus tokenizer can also map the same text to up to 1.35x more tokens, so the effective gap on identical prompts can be larger than the sticker price suggests.

Which model is better at coding, Gemini 3.1 Pro or Claude Opus 4.8?

On the headline coding benchmark, Claude Opus 4.8 leads. Anthropic reports 88.6% on SWE-bench Verified for Opus 4.8, while Google reports 80.6% for Gemini 3.1 Pro and independent runs put Gemini between 69.6% and 75.6%. These are vendor-reported on different harnesses and are not directly comparable, but the gap favors Claude for agentic coding work.

Can you compare Gemini and Claude benchmark scores directly?

Not safely. Benchmarks like SWE-bench Verified, Terminal-Bench, and OSWorld are sensitive to the test harness, prompting, and tool access each vendor uses. Gemini 3.1 Pro and Claude Opus 4.8 were measured on differently configured setups, so a small point difference rarely reflects real capability. Treat each score as directional within its own benchmark, not as a head-to-head ranking.

Gemini 3.1 Pro vs Claude Opus 4.8

Gemini 3.1 Pro vs Claude Opus 4.8: Which Frontier Model Wins in 2026?

Two of the strongest API models on the market, and they are not priced to compete on the same axis. Google's Gemini 3.1 Pro Preview lists at $2 per million input tokens. Anthropic's Claude Opus 4.8 lists at $5, and its newer tokenizer can quietly push the real bill higher than that number implies. Claude, meanwhile, posts the strongest coding and agentic numbers in this matchup. So the honest question is not "which model is smarter." It is "which model is the better deal for the work you actually run." We pulled the verified pricing and benchmarks, flagged what is vendor-reported versus independently measured, and took a position you can act on.

Read the Benchmarks Carefully

Gemini 3.1 Pro launched February 19, 2026 (as a Preview model, API id gemini-3.1-pro-preview). Claude Opus 4.8 launched May 28, 2026 (claude-opus-4-8). Every benchmark below is labeled vendor-reported or independent. The two vendors run different test harnesses, so scores are directional within each benchmark and are not a clean head-to-head. Where independent numbers diverge from a vendor's headline, we show both.

Quick Verdict: Gemini 3.1 Pro vs Claude Opus 4.8

VERDICT

Our Verdict

It splits cleanly by workload

Gemini 3.1 Pro is the default for cost-sensitive, high-volume, multimodal work: roughly 2.5x cheaper input and native video, audio, and image handling. Claude Opus 4.8 is the default for agentic coding and tool-use workloads, where it posts the strongest verified numbers in this matchup.

Verdict: These are not interchangeable. Pick Gemini 3.1 Pro if your bottleneck is cost or you process video, audio, and images: $2 input / $12 output per million tokens (at or below 200K context) is about 2.5x cheaper on input than Claude, and Gemini is natively multimodal in a single model. Pick Claude Opus 4.8 if your bottleneck is agentic coding quality: 88.6% on SWE-bench Verified (Anthropic-reported), a leading 74.6% on Terminal-Bench 2.1 (Anthropic-reported), and the top independent agentic scores here (1890 Elo on Artificial Analysis GDPval-AA, 82.2% on Scale AI's MCP-Atlas). The two models trade blows on raw science reasoning, where the gap is inside the noise. So the real decision is whether you are buying throughput at a price or buying the best agentic coder regardless of price.

The evidence, with sources labeled, follows.

Update (June 9, 2026): Anthropic's flagship is now Claude Fable 5, which sits above Opus 4.8. This comparison still reflects Opus 4.8 (a current, supported model); for the newest matchup see the Fable 5 review and the Fable 5 vs GPT-5.5 vs Gemini 3.1 Pro comparison.

Gemini 3.1 Pro is roughly 2.5x cheaper on input tokens ($2 vs $5 per million) and handles video and audio natively
Claude Opus 4.8 leads the coding and agentic benchmarks (88.6% SWE-bench Verified vendor-reported; top independent agentic scores), but costs more per token
Claude's newer tokenizer can raise the real per-request cost by up to 35% on the same text, so the sticker-price gap understates the spend gap
On PhD-level science reasoning (GPQA Diamond) the two are effectively tied, around 94% each
Benchmark scores are not directly comparable across vendors; treat them as directional and verify pricing before you commit budget

Gemini 3.1 Pro vs Claude Opus 4.8 at a Glance

$2 / 1M input

API Input Price

$5 / 1M input

$12 / 1M output

API Output Price

$25 / 1M output

1M in / 64K out

Context Window

1M in / 128K out

80.6% (Google) / 69.6-75.6% (indep.)

SWE-bench Verified

88.6% (Anthropic)

94.3% (Google) / 94.1% (indep.)

GPQA Diamond

93.6% (Anthropic)

68.5% (Terminal-Bench 2.0)

Terminal-Bench

74.6% (Terminal-Bench 2.1)

Native video, audio, image

Multimodal

Text and image; strong tool use

2.5x

Cheaper Input (Gemini)

$2 vs $5 per 1M tokens

88.6%

Claude SWE-bench Verified

Anthropic-reported

~94%

GPQA Diamond (Both)

94.3% Google / 93.6% Anthropic

1.35x

Claude Tokenizer Inflation

Up to, on identical text

1890

Claude GDPval-AA Elo

Artificial Analysis

How to Read These Numbers

Every benchmark here was produced on the vendor's own test harness unless we mark it as independent. SWE-bench Verified, Terminal-Bench, and GPQA Diamond results shift by several points depending on prompting, tool access, and scaffolding, and the two vendors did not use identical setups. Google's own SWE-bench Verified figure (80.6%) sits well above the 69.6 to 75.6% range that independent evaluators (LM Council, Failing Fast) reported for the same model. Treat a 1 to 3 point difference as noise; treat the price gap, which is structural, as real.

Contender Profiles

3.1 Pro

Google's frontier AI model, served through the Gemini API and Google AI Studio (API id gemini-3.1-pro-preview). Released February 19, 2026 as a Preview model. The differentiator that matters in this matchup: it is natively multimodal (text, image, video, audio, and code in one model) and it is priced aggressively, with a 1M-token input window. A separate custom-tools endpoint exists for bash and tool workflows.

API pricing: $2.00 input / $12.00 output per 1M tokens (at or below 200K context); $4.00 / $18.00 above 200K. Context caching cuts cached input roughly 90%. Batch API runs 50% off. Verified June 8, 2026.

ai.google.dev pricing

Opus 4.8

Anthropic's flagship model (API id claude-opus-4-8), released May 28, 2026. The differentiator that matters here: it posts the strongest coding and agentic numbers in this comparison, with adaptive thinking, effort controls, strong tool use, and self-verification behavior. It is the pricier model per token, and its newer tokenizer can raise the effective cost on identical prompts.

API pricing: $5.00 input / $25.00 output per 1M tokens (standard); $10.00 / $50.00 in fast mode (2.5x speed). 50% batch discount, up to 90% prompt-cache read. Verified June 8, 2026.

anthropic.com/pricing

Round 1 -- Reasoning & Science

Reasoning and Science: A Genuine Draw

This is where vendors love to fight, and where the gap is smallest. On hard science reasoning, the two models are separated by less than a point, and that point is inside the margin of error for the benchmark.

On GPQA Diamond (PhD-level science questions, the kind you would throw at a model when analyzing a dense research paper) Gemini 3.1 Pro scores 94.3% as reported by Google, and an independent run by the LM Council put it at 94.1% (plus or minus 1.7). It scores 95.1% on MATH and 92% on AIME 2025 with no tools, rising to 95.6% (plus or minus 3.1) in the LM Council mock. On ARC-AGI-2, a test built to measure reasoning flexibility on novel problems, Gemini hits 77.1%, confirmed on the independent ARC Prize leaderboard.

Anthropic reports 93.6% on GPQA Diamond for Claude Opus 4.8, which lands within a point of Gemini and well inside the benchmark's own confidence band. Where Opus 4.8 shows a different kind of strength is Humanity's Last Exam (HLE), a deliberately brutal frontier-knowledge test: 49.8% with no tools and 57.9% with tools, per Anthropic. The with-tools jump is the tell. Opus 4.8 is tuned to reason better when it can call tools and verify, which is the mode most production systems actually run in.

Notice what is missing: a clean, identical benchmark both vendors ran the same way. Google leans on MATH and AIME; Anthropic leans on HLE. The one shared test, GPQA Diamond, has them effectively tied. Anyone claiming a decisive reasoning winner here is reading vendor marketing, not the data.

Winner: Tie On the only directly shared benchmark (GPQA Diamond) the two are within a point, vendor-reported on both sides. Pick this dimension as a deciding factor only if you have run your own evaluation on your own prompts.

Round 2 -- Coding & Agentic Work

Coding and Agentic Work

If you are choosing an API model to drive a coding agent, this is the dimension that should decide it. And here the gap is wide enough to survive the harness caveat.

Anthropic reports 88.6% on SWE-bench Verified (fixing real issues in real repositories) and 69.2% on the harder SWE-bench Pro. On Terminal-Bench 2.1, command-line engineering tasks like writing scripts and configuring servers, Opus 4.8 posts 74.6%. The numbers that matter most, because they are independent, come from outside Anthropic: a 1890 Elo on Artificial Analysis GDPval-AA and 82.2% on Scale AI's MCP-Atlas agentic tool-use benchmark. Opus 4.8 also self-verifies more aggressively, which Anthropic says makes it roughly 4x less likely to ship flawed code without flagging it.

Gemini's case here is messier. Google reports 80.6% on SWE-bench Verified, but independent evaluators land lower: 75.6% (plus or minus 2.0) from the LM Council and 69.6% from Failing Fast. On Terminal-Bench 2.0, Google reports 68.5% natively, with up to 80.2% reachable using the third-party TongAgents framework. Take the most generous independent Gemini number (75.6%) against Claude's reported 88.6% and the gap is real, not a rounding artifact. Gemini's counter is structural, not benchmark-based: a 64K-token output ceiling and aggressive context caching make it cheaper to push or pull large volumes of code, which matters for whole-repo passes.

Winner: Claude Opus 4.8 Claude leads on its own SWE-bench Verified number (88.6% vs Gemini's 80.6% reported, 69.6 to 75.6% independent) and on the two independent agentic benchmarks here (GDPval-AA Elo, MCP-Atlas). The harness caveat applies, but it does not close an 8-to-19-point gap. If agentic coding is the job, Claude is the pick, and you pay for it.

Round 3 -- Multimodal & Input Range

Multimodal Range

This is the dimension where the two models are not playing the same game, and Gemini's design wins it outright.

1 model

Gemini 3.1 Pro processes text, image, video, audio, and code in a single native model. Claude Opus 4.8 handles text and images, with no native video or audio.

Gemini 3.1 Pro is natively multimodal: one model ingests text, images, video, audio, and code without bolting on a separate transcription or vision pipeline. If your workload includes analyzing recordings, video, or audio alongside text, Gemini is the only model of the two that does it natively. Pair that with a 1M-token input window and the aggressive token price, and Gemini is the obvious choice for large mixed-media ingestion.

Claude Opus 4.8 handles text and images. There is no native video or audio path, so audio and video work means a separate preprocessing step before Claude ever sees it. Where Opus 4.8 spends its capability budget instead is computer and tool use: it scores 83.4% on OSWorld (desktop task automation) and 84% on Online-Mind2Web (web navigation), both Anthropic-reported. That is a different strength, and a real one, but it does not close the multimodal-ingestion gap.

Winner: Gemini 3.1 Pro Native video and audio in a single model is a capability Claude Opus 4.8 simply does not offer. Claude's computer-use and web-navigation strength is impressive, but it answers a different question than "can the model watch this video for me."

Round 4 -- Cost & Pricing Model

Cost and Pricing Model

For anyone running these models at the API level, cost is not a footnote. It is the line item that decides which model you can afford to run at scale. Here Gemini's advantage is structural, and the tokenizer detail makes it bigger than it first looks.

Gemini 3.1 Pro lists at $2.00 per million input tokens and $12.00 per million output (at or below 200K context), rising to $4.00 / $18.00 above 200K. Context caching cuts cached input by roughly 90%, and the batch API takes another 50% off for offline jobs. At standard rates, Gemini is about 2.5x cheaper on input and roughly 2x cheaper on output than Claude. For a high-volume pipeline, that is the difference between a project that pencils out and one that does not.

Claude Opus 4.8 lists at $5.00 input / $25.00 output per million tokens at standard speed, doubling to $10.00 / $50.00 in fast mode (2.5x speed). It offers a 50% batch discount and up to 90% prompt-cache read. The detail buyers miss: the Opus 4.7-plus tokenizer maps the same text to up to 1.35x more tokens than older Opus versions, so a prompt that cost X before can cost up to 35% more now even though the per-token price did not change. Your real Claude bill can run higher than a naive per-token comparison suggests.

Winner: Gemini 3.1 Pro Cheaper sticker price on both input and output, deeper caching discounts, and no tokenizer surprise. If cost per token drives your decision, this is not close. Pricing verified June 8, 2026; both vendors change rates, so confirm before you commit budget.

Round 5 -- Tool Use & Reliability

Tool Use and Reliability

This dimension is about what happens when the model has to act, not just answer: calling tools, navigating an environment, and catching its own mistakes. For agent builders, it is the dimension that separates a demo from a system you can leave running.

Opus 4.8 posts the strongest tool-use and computer-use numbers here: 82.2% on Scale AI's MCP-Atlas (an independent agentic tool-use benchmark), 83.4% on OSWorld (desktop automation), and 84% on Online-Mind2Web (web navigation). It also adds effort controls (high, extra, max) so you can dial up deliberation on hard tasks, and Anthropic reports stronger self-verification, roughly 4x less likely to pass off flawed code without flagging it. The MCP-Atlas score is the one to weight most, because it comes from outside Anthropic.

Gemini 3.1 Pro is no slouch on agentic work, with a dedicated custom-tools endpoint (gemini-3.1-pro-preview-customtools) for bash and tool workflows, and a Terminal-Bench 2.0 score of 68.5% that climbs to 80.2% with the third-party TongAgents framework. But Google has not published a directly comparable MCP-Atlas or OSWorld number for Gemini 3.1 Pro, so on the independent agentic benchmarks where both appear, Claude is the one with the verified lead. Gemini's strength is that you can run far more agentic steps per dollar.

Where the Comparison Breaks Down

Claude's agentic edge here rests partly on benchmarks (MCP-Atlas, OSWorld) that Google has not run for Gemini 3.1 Pro, so this is not a like-for-like sweep. What we can say with confidence is that on the independent agentic tests where both models have published numbers, Claude leads, and that neither model is reliable enough to run unsupervised on high-stakes tasks. Self-verification reduces bad output; it does not eliminate it. Build a human check into any workflow where a wrong answer is expensive.

The responsible position: verify critical outputs from both models, every time.

Winner: Claude Opus 4.8 Claude leads on the independent agentic benchmarks here (MCP-Atlas 82.2%, OSWorld 83.4%) and adds effort controls and stronger self-verification. The caveat: Google has not published matching numbers for Gemini, so weight this as "Claude leads where both were measured," not a clean sweep. If you are building AI governance around agent reliability, build verification layers no matter which model you choose.

Dimension-by-Dimension Scorecard

Gemini 2 WINS

Claude 2 WINS

Tie 1 DRAW

8.0

Reasoning & Science

8.0 TIE

7.5

Coding & Agentic

WIN 9.0

WIN 9.5

Multimodal Range

5.5

WIN 9.0

Cost & Pricing

6.0

7.5

Tool Use & Reliability

WIN 8.5

Scores are our editorial reading of the verified evidence, not a single shared benchmark. A 2-2-1 split is the honest result: the two models win different jobs. The tally is a tie on count, which is why the decision comes down to your workload, not a leaderboard.

Gemini 3.1 Pro vs Claude Opus 4.8 API Pricing

Input (≤200K)

$2.00 / 1M

Gemini 3.1 Pro standard input

$5.00 / 1M

Claude Opus 4.8 standard input

Output (≤200K)

$12.00 / 1M

Gemini 3.1 Pro standard output

$25.00 / 1M

Claude Opus 4.8 standard output

Long Context

$4 / $18 / 1M

Above 200K context (input / output)

$5 / $25 / 1M

Flat to 1M input; no long-context surcharge tier published

Fast Mode

No separate fast-mode tier

$10 / $50 / 1M

Claude fast mode, 2.5x speed

Caching

~90% off

Context caching on cached input (plus storage fee)

up to 90%

Prompt-cache read discount

Batch

50% off

Up to 24h turnaround

50% off

Batch processing

Pricing verified June 8, 2026. Both vendors change rates; confirm at ai.google.dev pricing and anthropic.com/pricing before committing budget.

Read down the columns and the pattern is consistent: Gemini 3.1 Pro is cheaper at every standard tier, by roughly 2.5x on input and 2x on output. Both vendors offer aggressive caching (around 90%) and a 50% batch discount, so those level out for high-reuse workloads. Claude's fast mode buys 2.5x speed at double the price, which has no Gemini equivalent in this matchup.

The tokenizer catch most cost models miss: Claude's Opus 4.7-plus tokenizer can map the same text to up to 1.35x more tokens than older Opus versions. The per-token price did not change, but the token count for an identical prompt did, so your effective cost per request can rise as much as 35%. When you model Gemini-versus-Claude spend, do not just divide the sticker prices. Run your real prompts through both tokenizers and compare the billed token counts, or you will undercount Claude. For production systems where token cost is a line item, this detail moves the break-even point.

What This Means at Scale

Illustrative Monthly API Spend

The per-token gap looks small until you multiply it by real volume. The table below uses the verified standard rates ($2 input / $12 output for Gemini 3.1 Pro; $5 / $25 for Claude Opus 4.8) against a simple workload mix of 5 input tokens for every 1 output token. These are illustrative math, not quotes, and they ignore caching, batch discounts, and Claude's tokenizer inflation, all of which change the real bill.

Monthly Volume (input + output)	Gemini 3.1 Pro	Claude Opus 4.8	Monthly Difference
50M + 10M	$220	$500	$280
100M + 20M	$440	$1,000	$560
500M + 100M	$2,200	$5,000	$2,800
1B + 200M	$4,400	$10,000	$5,600
5B + 1B	$22,000	$50,000	$28,000

Note: figures assume standard-rate, sub-200K-context pricing and a 5:1 input-to-output ratio. Caching (around 90% off cached input) and batch (50% off) reduce both columns. Claude's tokenizer can raise its real token counts, widening the gap further. Run your own prompts through both billing meters before you commit.

Before you switch a production pipeline: the cheapest model is not automatically the right one. If a Claude run resolves a coding task in one pass where Gemini needs two or three retries, Claude can be cheaper per completed task even at the higher token price. Measure cost per successful outcome, not cost per token.

Moving an Integration Between the Two

Expect friction at the API layer:

Prompt formats, system-prompt conventions, and tool-call schemas differ between the Gemini API and the Anthropic API
Output token ceilings differ (Gemini caps at 64K out, Claude at 128K out), so long-generation logic may need adjusting
Claude's tokenizer means your token budgets and rate-limit math will not carry over one-to-one
Re-run your evaluation suite on the new model before cutting over; benchmark wins do not guarantee wins on your prompts

Admin and Compliance

Both vendors offer enterprise data processing agreements and opt-outs from training on your data; confirm the terms for your specific account tier
Verify data residency and retention with each vendor directly if you operate in a regulated industry
The Preview status of Gemini 3.1 Pro is worth flagging to procurement: Preview terms and availability can change with less notice than a generally available model

Data Training Policies

By default, free and consumer tiers from major AI vendors may use your prompts to improve their models; enterprise API tiers generally offer opt-outs. Confirm your specific tier's policy before sending sensitive data, and verify data residency requirements directly with each vendor. Neither this article nor a vendor marketing page substitutes for a legal review of your own compliance obligations.

Best For

Question 1 of 3

What is the core workload?

Pick the one closest to what the model will spend most of its time doing

Claude Opus 4.8

Teams building coding agents

Claude leads SWE-bench Verified (88.6% Anthropic-reported) and the independent agentic benchmarks here (MCP-Atlas 82.2%, GDPval-AA 1890 Elo), plus stronger self-verification. If the agent ships code, pay the premium.

Gemini 3.1 Pro

High-volume, cost-sensitive pipelines

At $2 input / $12 output, roughly 2.5x cheaper than Claude on input, Gemini is the default when token cost decides whether the workload is viable. The caching and batch discounts widen the gap further.

Gemini 3.1 Pro

Video, audio, and mixed-media work

Gemini is natively multimodal across text, image, video, audio, and code. Claude Opus 4.8 has no native video or audio path, so anything beyond text and images means a separate preprocessing step.

Either

Hard science reasoning and Q&A

On GPQA Diamond the two are within a point (94.3% Google vs 93.6% Anthropic). For pure reasoning, pick on price (Gemini) unless you specifically need Claude's tool-augmented reasoning mode.

Claude Opus 4.8

Workflows that need long, single-pass output

Claude's 128K output ceiling is double Gemini's 64K, which matters when a single response has to be very long. Just remember to model the tokenizer inflation into your cost. Browse the wider AI tools landscape for adjacent options.

Both

Teams that can route by task

The mature pattern is to use both: Gemini for cheap, high-volume, multimodal calls and Claude for the agentic coding steps where quality pays for itself. A router that picks per task often beats committing to one model.

Edge Cases: When the Wrong Choice Wins

Classification, extraction, tagging, and summarization at millions of calls a day. Claude's higher benchmark scores do not matter if the per-token cost makes the project uneconomic. At roughly 2.5x cheaper input, Gemini wins this on the spreadsheet alone.

Even though it costs more. Claude Opus 4.8 leads SWE-bench Verified and the independent agentic benchmarks (MCP-Atlas, OSWorld), and its self-verification reduces the silent-failure rate that makes coding agents expensive to babysit. Fewer bad runs can be cheaper than fewer cheap tokens.

Gemini's native multimodal handling plus a 1M-token input window and 90% context caching make large mixed-media or whole-repo ingestion both possible and affordable. Claude needs a preprocessing step for audio and video and charges more per token to read it.

The trap: optimizing cost per token instead of cost per successful result. If Claude solves a task in one pass where Gemini needs three retries, Claude can win on total cost despite the higher sticker price. Measure outcomes, then decide.

Watch and Learn

YouTube Search

Gemini 3.1 Pro vs Claude Opus 4.8 head-to-head

YouTube Search

Claude Opus 4.8 coding benchmark walkthroughs

YouTube Search

Gemini 3 Pro review and capability deep dives

YouTube Search

Go Deeper

Resources from across Tech Jacks Solutions

AI Career Paths

Explore roles that work with these tools daily

FREEAI Governance Charter

Establish your organization's AI principles in one document

AI Glossary

Definitions for AI terms used in this article

Fact-checked against vendor documentation and official sources, June 2026. Verify current pricing at ai.google.dev and anthropic.com before purchasing.

Freshness notice: AI models and pricing change fast, and Gemini 3.1 Pro is a Preview model whose terms can shift. Pricing and benchmarks here were verified June 8, 2026. If you are reading this more than 90 days after that date, confirm current rates and scores before committing. Check our AI Tools Hub for the latest updates.

Google and Gemini are trademarks of Google LLC. Anthropic and Claude are trademarks of Anthropic PBC. Benchmark names (SWE-bench, Terminal-Bench, GPQA, OSWorld, MCP-Atlas, GDPval) belong to their respective owners. Tech Jacks Solutions is not affiliated with, sponsored by, or endorsed by Google or Anthropic.

Gallery

Contacts

Gemini 3.1 Pro vs Claude Opus 4.8: Which Frontier Model Wins in 2026?

Quick Verdict: Gemini 3.1 Pro vs Claude Opus 4.8

Gemini 3.1 Pro vs Claude Opus 4.8 at a Glance

Contender Profiles

Reasoning and Science: A Genuine Draw

Coding and Agentic Work

Multimodal Range

Cost and Pricing Model

Tool Use and Reliability

Gemini 3.1 Pro vs Claude Opus 4.8 API Pricing

What This Means at Scale

Illustrative Monthly API Spend

Moving an Integration Between the Two

Admin and Compliance

Best For

Edge Cases: When the Wrong Choice Wins

Go Deeper

Services

Learn

Company