Cutting the bill: FinOps for LLMs
Every API call to a language model costs money — metered token by token. As an app scales from a demo to thousands of users a day, that meter can run away from you fast. This lesson shows how the bill is built, then walks the real levers — caching, batching, smaller models, compression, semantic caching — that pull it back down, with a calculator you can drive right here.
01How the bill is built: tokens, in and out
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
LLM APIs don't charge per request or per minute — they charge per token, the small chunks of text a model reads and writes. Pricing is normally quoted per million tokens, and it's split into two separate rates: one for input (the prompt you send) and one for output (the text the model generates). Almost universally, output tokens cost more than input tokens — often several times more — because generating text is the expensive part. That single fact makes controlling how much the model writes one of your first cost levers. Your monthly bill is essentially: requests per day × (input tokens × input rate + output tokens × output rate) × days.
- Per-token, per-million: cost scales with the text volume you push through, not the number of calls.
- Input vs. output asymmetry: generated tokens are usually the pricier side, so trimming output length pays off quickly.
- It's a value question, not just a cost one: mature teams measure cost per request, per user, per feature — unit economics — and weigh it against the value delivered, the way the FinOps Foundation frames AI spend.
Prices change constantly. Every per-token rate, discount percentage, and reserved-capacity rate quoted by any vendor is vendor-reported and updates frequently. Treat the numbers in the calculator below as illustrative — always confirm current pricing on the provider's live pricing page before you budget.
02The five levers that move the bill
Most LLM cost savings come from a small set of repeatable moves. They're not all equal, and a couple come with trade-offs. Switch between them to see what each one does and when it helps.
Right-size the model — the single largest lever
Vendor pricing tables show order-of-magnitude differences between tiers — a small/fast model (Haiku-, Flash-, or Lite-class) versus a frontier model (Opus- or Pro-class). Sending every request to the most capable model is the most common source of overspend. A cascade (the FrugalGPT idea) takes this further: route to the cheapest model first and only escalate to a pricier one if a quality check fails.
Prompt caching — stop paying for the same context twice
If many requests share a long, identical prefix (a big system prompt, a shared document), prompt caching lets the provider reuse it at a reduced input rate. OpenAI does this automatically for prompts at or above 1,024 tokens with no code change; Anthropic and Google bill cache reads at roughly 10% of the base input price (cache writes carry a premium or a storage/TTL fee). It cuts input cost and latency — but only for the repeated portion.
Batch API — trade speed for a discount
If results don't need to be instant, send work asynchronously. OpenAI's Batch API and Azure OpenAI Batch run at about 50% off input and output; AWS Bedrock batch inference is around 50% of on-demand for eligible models — in exchange for up to a ~24-hour turnaround. Great for offline jobs, terrible for anything a user is waiting on.
Prompt compression — send fewer tokens
Because you pay per token, the cheapest token is the one you never send. Tightening prompts, dropping redundant context, and capping output length all reduce billed tokens directly. Since output is usually the pricier side, limiting how much the model writes (concise instructions, output caps) is an especially effective version of this lever.
Semantic cache — answer near-duplicates without calling the model
A semantic cache (like the open-source GPTCache) embeds each incoming query and checks whether a sufficiently similar one was already answered. If so, it serves the stored answer and skips the LLM call entirely. That's different from prompt caching, which still runs the model — here you avoid the request altogether for repetitive traffic.
Two levers not in the calculator deserve a flag. Provisioned/reserved capacity (Azure PTU, Bedrock provisioned throughput) bills a flat hourly rate regardless of tokens — it only beats per-token pricing at high, steady volume, and idle reserved capacity is wasted money. Quantization and distillation shrink a model to run cheaper, but they're self-hosting levers with accuracy trade-offs, not API-side switches.
03Drive the meter: a token-cost & savings lab
Set your workload, pick a model tier, and watch the monthly bill. Then switch on the levers one at a time and see each chip away at the cost — and the running total saved. The per-token prices here are illustrative placeholders chosen to show realistic relationships (output costs more than input; tiers differ by an order of magnitude); they are not any vendor's live rates.
Your workload
0%
Illustrative model only. Lever effects use representative, vendor-reported assumptions (e.g. caching reduces the cached input portion; batch ~50% off; semantic cache skips a share of calls) — actual savings depend entirely on your prompt structure, cache-hit rate, and current provider pricing.
04You can't cut what you can't see
Every lever above assumes you actually know where the money goes — and most teams don't, until they instrument for it. FinOps for LLMs starts with visibility: log token usage and cost per request, then attribute it to a team, feature, or user. Open-source proxies and observability tools (Helicone is one) sit in front of the API and record latency, result, and cost on every call, which is what lets you compute unit economics — cost per request, per active user, per feature — instead of staring at one big monthly invoice. With that in hand you can spot the runaway feature, prove a lever worked, and decide whether a workload's value justifies its spend.
Cost decisions are also governance decisions. The NIST AI Risk Management Framework (its Govern, Map, Measure, Manage functions) gives a structure for accountable AI resource decisions, and the FinOps Foundation's cross-functional model answers the human question underneath it: who owns and answers for AI spend? Engineering, finance, and product looking at the same per-feature numbers is what turns one-off cost-cutting into a durable practice.
- Instrument first: per-request token + cost logging is the foundation every other lever stands on.
- Attribute the cost: tie spend to teams, features, and users to find where it actually concentrates.
- Govern the decision: a framework (NIST AI RMF) plus cross-functional ownership keeps cost-and-value accountable over time.
05Check your understanding
06Take it with you & go deeper
LLM Routing & Gateways
The infrastructure that makes model cascades and per-request cost control practical.
Read →Inference Optimization
KV-cache, batching, and latency — the serving-side levers behind cheaper self-hosted inference.
Read →Small Language Models & On-Device AI
When a smaller, cheaper model — even one running locally — is the right call.
Coming soonLLMOps: Monitoring & Observability
The instrumentation layer that turns per-request cost logging into a durable practice.
Coming soon⊕Concept map
The whole lesson at a glance — how the bill is built, the levers that move it, and what keeps cost accountable. Expand a branch to review.
How the bill is built: tokens, in and out
- LLM APIs meter cost per token, normally quoted per million tokens — cost scales with text volume, not the number of calls.
- Input (prompt) and output (generated) tokens are billed at separate rates; output usually costs more, so trimming generation length pays off.
- Mature teams treat it as a value question — unit economics (cost per request, user, feature) weighed against value, as the FinOps Foundation frames it.
The five levers that move the bill
- Right-size the model — the single largest lever; a cascade (FrugalGPT) routes to the cheapest model first and escalates only on a failed quality check.
- Prompt caching reuses a repeated prefix at a reduced input rate; batch (async) trades speed for roughly 50% off non-urgent work.
- Prompt compression sends fewer tokens; a semantic cache (e.g. GPTCache) skips the model entirely on near-duplicate queries.
- Two off-calculator levers: provisioned/reserved capacity (only wins at high steady volume) and quantization/distillation (a self-hosting lever with accuracy trade-offs).
Drive the meter: a token-cost & savings lab
- Set requests per day, input and output tokens, and a model tier, then watch the baseline monthly bill build from those inputs.
- Toggle each lever to see it chip away at the cost and a running cumulative-saved total.
- The per-token prices are illustrative placeholders chosen to show realistic relationships (output > input; tiers differ by an order of magnitude), not any vendor's live rates.
You can't cut what you can't see
- Instrument first: log token usage and cost per request (proxies/observability like Helicone) to compute unit economics instead of one big invoice.
- Attribute the cost to a team, feature, or user to find where spend actually concentrates.
- Govern the decision: the NIST AI RMF (Govern/Map/Measure/Manage) plus cross-functional ownership keeps cost-and-value accountable over time.
Continue your path
Where to go next
You just finished AI Cost Optimization (FinOps for LLMs). Here’s a natural progression — from what builds directly on it to where to go deeper.
Continue with Small Language Models & On-Device AI.
Agentic~11 min
Model Serving & Deployment Patterns
+What you’ll learnHide
Continue with Model Serving & Deployment Patterns.
Open lesson →
Agentic~10 min
Model Context Protocol
+What you’ll learnHide
What MCP is, how hosts, clients and servers connect, and why it matters.
Open lesson →
Agentic~10 min
AI Agents
+What you’ll learnHide
How agents perceive, reason, use tools and act, and how they differ from chatbots.
Open lesson →
Agentic~8 min
RAG
+What you’ll learnHide
How retrieval grounds LLM answers, step by step.
Open lesson →
Agentic~13 min
LLM Routing & Gateways
+What you’ll learnHide
Continue with LLM Routing & Gateways.
Open lesson →
Agentic~12 min
Inference Optimization (KV-Cache, Batching, Latency)
+What you’ll learnHide
Continue with Inference Optimization (KV-Cache, Batching, Latency).
Open lesson →Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established cost-optimization mechanisms and is grounded in the references below. All per-token prices, discount percentages, and capacity rates are vendor-reported and change frequently — figures in the calculator are illustrative and labelled as such; confirm current pricing at the source before budgeting.
- FinOps for AI — Overview — FinOps Foundation
- Optimizing GenAI Usage: Cost, Performance, and Efficiency — FinOps Foundation
- How to Build a GenAI Cost and Usage Tracker — FinOps Foundation
- OpenAI API Pricing — OpenAI
- Prompt Caching guide — OpenAI
- Batch API guide — OpenAI
- Prompt Caching (Claude API) — Anthropic
- Context Caching (Gemini API) — Google
- Amazon Bedrock Pricing — AWS
- Batch inference (Amazon Bedrock) — AWS
- Provisioned throughput for Foundry Models — Microsoft Learn
- FrugalGPT: Reducing Cost & Improving Performance (arXiv:2305.05176) — Chen, Zaharia, Zou (Stanford)
- GPTCache — Semantic cache for LLMs — Zilliz
- Cost Tracking & Optimization — Helicone
- AI Risk Management Framework (AI RMF) — NIST
This is an educational explainer. The calculator is an illustrative model, not a quote — real costs depend on your traffic, prompt structure, cache-hit rate, and each provider's current pricing, which changes often. Cost decisions can also carry quality, privacy, and reliability trade-offs (e.g. a cheaper model or aggressive caching may reduce answer quality). Verify pricing and eligibility on official vendor pages, and weigh cost against the value and risk of each workload before acting on it.
AI cost optimization (FinOps for LLMs) — in 5 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
How the bill is built
LLM APIs charge per token, normally quoted per million, with separate input and output rates. Output usually costs more than input, so controlling generation length is a primary lever. Mature teams track cost per request/user/feature (unit economics), not just the total bill.
The five cost levers
Right-size the model (biggest lever — tiers differ by ~order of magnitude; cascade cheap→expensive). Prompt caching (reuse a repeated prefix at a reduced input rate). Batch API (~50% off for async work). Prompt compression (send fewer tokens; cap output). Semantic cache (skip the model on near-duplicate queries).
Trade-offs
Reserved/provisioned capacity (Azure PTU, Bedrock provisioned) is a flat hourly charge — cheaper only at high, steady volume. Quantization/distillation are self-hosting levers with accuracy trade-offs, not API switches.
Measure & govern
Instrument per-request token + cost logging and attribute it to teams/features. Use the NIST AI RMF (Govern/Map/Measure/Manage) plus cross-functional FinOps ownership to keep cost-and-value accountable.
Prices change
Every rate and discount is vendor-reported and updates often. Treat all figures as illustrative; verify on the live vendor pricing page before budgeting.