Slashing Gemini API Costs With Context Caching
If you send the same long system prompt, document, or codebase with every request, you are paying full input price for tokens Google has already seen. Context caching cuts the bill on those repeated tokens by roughly 90%. On Gemini 3.1 Pro that turns $2.00 per million input tokens into $0.20 per million on the cached part. This guide shows the two caching modes, the exact pricing as of June 9, 2026, the math on when each one pays off, and the SDK code to wire up explicit caching yourself.
The 90% Payoff, Up Front
Here is the part worth acting on today. If your application re-sends the same large context on every call, a long instruction block, a reference document, a whole codebase, you are billed full input price for tokens the model has effectively already ingested. Caching changes the price on those repeated tokens. On Gemini 3.1 Pro the standard input rate is $2.00 per million tokens. A cached read of that same content is $0.20 per million. That is the 90% number, and it applies to the cached segment of each request, not your whole bill.
There are two ways to get there. One needs no code at all and is already running on your paid project. The other needs a few lines of SDK setup and a small ongoing fee, but gives you a result you can predict and budget for. Most teams should confirm the free one is working first, then reach for the paid one only when the numbers justify it.
The fast answer: implicit caching is already on for paid projects, so if you repeat context you are probably saving money without knowing it. Explicit caching is the lever you pull when you want guaranteed savings on one big shared context that gets re-used many times inside an hour or so.
Who this is for
You are running production traffic against the Gemini API, your input token counts dwarf your output token counts, and a big chunk of that input is the same on every request. Retrieval pipelines with a fixed knowledge base, coding agents that re-read a repository, document-QA tools, and long-system-prompt chat apps all fit this shape. If your inputs are short and rarely repeat, caching will not move your bill much, and you can skip to the decision section to confirm that.
Implicit vs Explicit Caching
As of May 2026 the Gemini API offers two caching modes. They both target the same problem and both deliver roughly a 90% discount on repeated input, but they behave very differently in practice. The difference comes down to who controls the cache and who pays to keep it warm.
Automatic, no knobs
Implicit caching is enabled by default on every paid project. Google hashes your recent inputs, and when a new request matches a recent one, the overlapping tokens are billed at the cached read rate automatically. You write no code, you flip no switch, and you pay no separate write or storage fee. The trade is control: you do not get to decide what stays cached or for how long, and you cannot guarantee a hit on any given request. You see the savings show up on the bill.
Deterministic, you pay to hold it
Explicit caching is opt-in. You upload a block of content, assign it a time-to-live, and get back a cache ID that you reference on later requests. The savings are predictable because you decide exactly what is cached and the read rate is fixed. In exchange you pay a one-time write fee to create the entry and an hourly storage fee for as long as the entry lives. It is a deliberate trade of a small known cost for a large known saving.
One practical note before the numbers: implicit caching has a small per-model minimum (roughly 2K-4K tokens); explicit caching has no stated minimum. Treat caching as something that helps in proportion to how much context you repeat, and validate the actual savings against your own usage metadata rather than assuming a cutoff.
Think of implicit caching as a discount the platform applies when it happens to recognize your input, and explicit caching as reserving a named locker for content you know you will reuse. The locker costs rent. Whether the rent is worth it is the whole question, and the next two sections answer it with real figures.
Cache Pricing, By Model
Here are the explicit-cache rates that matter, pulled from Google's published pricing and verified on June 9, 2026. The headline model is Gemini 3.1 Pro, where the gap between standard input and cached read is largest in absolute dollars. The Flash models are cheaper across the board, which changes the math on whether explicit caching is worth the storage fee at all.
| Model | Standard Input / 1M | Cache Read / 1M | Cache Write / 1M | Storage / 1M / Hr |
|---|---|---|---|---|
| Gemini 3.1 Pro | $2.00 (≤200K) $4.00 (>200K) |
$0.20 (≤200K) $0.40 (>200K) |
$0.50 | $4.50 |
| Gemini 3 Flash | $0.50 | $0.05 | text/image/video | $1.00 |
| Gemini 3.1 Flash-Lite | $0.25 | $0.025 | text/image/video | $1.00 |
Read the Pro row first. A cached read costs one tenth of the standard input rate, which is where the 90% figure comes from. The cache write of $0.50 per million is a one-time charge when you create the entry, so it amortizes to almost nothing once the content is read more than a few times. Storage is the line that bites: $4.50 per million tokens per hour means a 1M-token cache held for a full day costs about $108 in storage alone, whether or not you read it. That is why TTL discipline matters, and why explicit caching is a poor fit for content you touch only occasionally.
The Flash rows tell a different story. Flash input is already cheap at $0.50 and $0.25 per million, and the cached reads drop to $0.05 and $0.025. The absolute savings per read are small, so the $1.00-per-hour storage fee eats into them quickly. On Flash, implicit caching usually captures the realistic upside without you paying to hold anything. Reserve explicit caching for the Pro tier and for genuinely high-reuse Flash workloads.
All figures here are Google-reported and verified June 9, 2026. Preview models can change pricing and carry tighter rate limits. Confirm current rates on the official pricing page before you commit a budget.
The Cost Math In Plain Dollars
Take a concrete workload: a 100K-token system prompt that your app re-sends on every request, run across 1,000 requests on Gemini 3.1 Pro. The naive cost is easy. That is 100,000 tokens times 1,000 calls, or 100 million input tokens, billed at $2.00 per million. Round number: about $200, just for the repeated prompt.
Now wire up explicit caching with a one-hour TTL. You write the 100K-token prompt to a cache once, which is a fraction of a million tokens at the $0.50 write rate, so the write costs roughly $0.05. Each of the 1,000 requests then reads that 100K-token context at the cached rate of $0.20 per million, which works out to about $20 in reads total. Holding 100K tokens in storage for the hour at $4.50 per million per hour costs around $0.45. Add it up and you land near $20.50 against the original $200. That is the 90% saving, made of real line items rather than a marketing round number.
- 100M input tokens$2.00 / 1M
- Cache write$0
- Storage$0
- Cache write (once)~$0.05
- 1,000 cached reads~$20.00
- Storage (1 hour)~$0.45
The shape of this math is what you should internalize, not the exact total. Read cost scales with how many times you reuse the content, so high reuse is what makes caching pay. Storage cost scales with how long you hold the entry, so a stale cache you forgot to expire is pure waste. The write fee is trivial. Keep the TTL just long enough to cover your burst of reuse, and the savings hold.
Wiring Up Explicit Caching
Explicit caching is three moves: create the cache, reference it on each request, then read the usage metadata to confirm what was billed at the cached rate. All of it runs through the unified google-genai SDK. Start by creating the entry with your shared context and a TTL.
# pip install -U google-genai from google import genai from google.genai import types client = genai.Client() # The shared context you re-send on every request system_context = open("company_handbook.txt").read() cache = client.caches.create( model="gemini-3.1-pro-preview", config=types.CreateCachedContentConfig( display_name="handbook-cache", contents=[system_context], ttl="3600s", # 1 hour, keep it tight ), ) print(cache.name) # reference this on later requests
With the entry created, every subsequent call points at the cache name instead of resending the full context. Only your per-request question is billed at the standard input rate. The cached block is billed at the cached read rate.
# Reuse the cache across many requests response = client.models.generate_content( model="gemini-3.1-pro-preview", contents="What is the PTO carryover policy?", config=types.GenerateContentConfig( cached_content=cache.name, ), ) print(response.text) # Confirm what got billed at the cached rate usage = response.usage_metadata print("cached tokens:", usage.cached_content_token_count) print("prompt tokens:", usage.prompt_token_count)
That last block is the one practitioners skip and regret. The cached_content_token_count field tells you exactly how many tokens were billed at the discounted rate on each call. If that number is zero when you expected a hit, your cache reference is wrong, the entry expired, or the content drifted. Log it in staging and reconcile against your bill before you trust the savings in production.
For implicit caching there is no code at all. Send the same context the same way on consecutive requests and check cached_content_token_count on the response. If it is nonzero, the platform recognized your input and discounted it for free.
When Explicit Beats Implicit
Explicit caching is not a free upgrade over implicit. You are adding a storage meter that runs whether or not you read the entry. The decision is a straight comparison: does the value of your discounted reads over the cache lifetime exceed what you pay to hold it? Use this checklist to settle it before you write any code.
-
SignalOne large context, many reusesA fixed knowledge base, handbook, or codebase that dozens or hundreds of requests share inside a short window.
-
SignalReuse is concentrated in timeThe reads happen within an hour or two, so a short TTL covers them and storage stays cheap.
-
HelpsYou are on the Pro tierThe absolute per-read saving on 3.1 Pro is large enough to clear the storage fee. On Flash the gap is thin.
-
HelpsYou want predictable savingsYou need a number you can put in a budget rather than whatever implicit caching happens to catch.
If your traffic is bursty and naturally repetitive on its own, implicit caching already grabs most of the upside for nothing, and adding explicit caching can actually cost more once storage is counted. The honest default is: let implicit caching run, instrument cached_content_token_count, and only introduce explicit caching for the specific high-reuse, Pro-tier paths where the ledger clearly favors it.
The Cross-Vendor Picture
Caching does not live in a vacuum. The reason it matters so much on Gemini is that the model already carries a very large context window, so the kind of workload where you stuff an entire codebase or document set into the prompt is realistic in the first place. Pair a long context window with a 90% read discount and the economics of whole-repository analysis change.
One independent, illustrative comparison makes the point. Analysts have estimated that a large-codebase analysis workload that might cost on the order of $90,000 on a top-tier competitor model can drop to roughly $3,500 on Gemini 3.1 Pro by leaning on its long context window plus caching. Treat that figure as directional rather than a quote you can hold either vendor to. It depends heavily on token volume, reuse pattern, and how aggressively caching is applied, and it comes from independent third-party analysis, not from Google or the competing vendor.
The $90,000 to $3,500 comparison is independent and illustrative, drawn from third-party analyses rather than vendor pricing sheets. Use it to understand the magnitude of the lever, not as a guaranteed quote. Always model your own workload against the current rates.
The takeaway is not that one vendor is always cheaper. It is that on input-heavy, repetitive workloads, caching turns input volume from your dominant cost into a rounding error, and that shifts which model is the rational choice for a given job. If you are weighing options, the Gemini versus ChatGPT comparison covers the broader trade-offs beyond price.
Gotchas And Troubleshooting
Cause: The TTL is too long for the reuse, so storage at $4.50 per 1M per hour outweighs the read savings, or the entry is barely read at all.
Fix: Shorten the TTL to cover only your burst of reuse, and delete the cache when you are done. If reuse is naturally low, drop explicit caching and rely on implicit caching instead. See the decision checklist.
Cause: The cache reference is wrong, the entry expired, or the content you sent no longer matches what was cached.
Fix: Confirm you are passing the exact cache.name returned at creation, that the TTL has not lapsed, and that the cached content is byte-for-byte identical to what you uploaded. Read the caching docs for the matching rules.
Cause: By design. Implicit caching has no knob. Google decides what stays warm and for how long based on recent inputs.
Fix: If you need a guarantee, that is exactly what explicit caching is for. To improve implicit hit rates, keep your repeated context in a stable position at the start of the request so consecutive calls match.
Cause: Flash input is already cheap ($0.50 on 3 Flash, $0.25 on 3.1 Flash-Lite), so the absolute saving per cached read is small while storage is a flat $1.00 per 1M per hour.
Fix: On Flash, lean on implicit caching and reserve explicit caching for very high reuse. The big explicit-cache wins are on the Pro tier where the dollar gap is wide.
Cause: Gemini 3.1 Pro, 3 Flash, and 3.1 Flash-Lite are preview models, and preview pricing can move.
Fix: The rates here are verified June 9, 2026. Re-check the official pricing page before budgeting, and re-run the cost math with current numbers.
Google, Gemini, Google AI Studio, and Vertex AI are trademarks of Google LLC. This article is not affiliated with, sponsored by, or endorsed by Google LLC.