Agentic systems lesson

Track 03 · Agentic Intermediate ~8 min

Cutting the bill: FinOps for LLMs

Every API call to a language model costs money — metered token by token. As an app scales from a demo to thousands of users a day, that meter can run away from you fast. This lesson shows how the bill is built, then walks the real levers — caching, batching, smaller models, compression, semantic caching — that pull it back down, with a calculator you can drive right here.

Module progress

01How the bill is built: tokens, in and out

LLM APIs don't charge per request or per minute — they charge per token, the small chunks of text a model reads and writes. Pricing is normally quoted per million tokens, and it's split into two separate rates: one for input (the prompt you send) and one for output (the text the model generates). Almost universally, output tokens cost more than input tokens — often several times more — because generating text is the expensive part. That single fact makes controlling how much the model writes one of your first cost levers. Your monthly bill is essentially: requests per day × (input tokens × input rate + output tokens × output rate) × days.

Per-token, per-million: cost scales with the text volume you push through, not the number of calls.
Input vs. output asymmetry: generated tokens are usually the pricier side, so trimming output length pays off quickly.
It's a value question, not just a cost one: mature teams measure cost per request, per user, per feature — unit economics — and weigh it against the value delivered, the way the FinOps Foundation frames AI spend.

Prices change constantly. Every per-token rate, discount percentage, and reserved-capacity rate quoted by any vendor is vendor-reported and updates frequently. Treat the numbers in the calculator below as illustrative — always confirm current pricing on the provider's live pricing page before you budget.

02The five levers that move the bill

Most LLM cost savings come from a small set of repeatable moves. They're not all equal, and a couple come with trade-offs. Switch between them to see what each one does and when it helps.

InteractiveTap a lever

Right-size the model — the single largest lever

Vendor pricing tables show order-of-magnitude differences between tiers — a small/fast model (Haiku-, Flash-, or Lite-class) versus a frontier model (Opus- or Pro-class). Sending every request to the most capable model is the most common source of overspend. A cascade (the FrugalGPT idea) takes this further: route to the cheapest model first and only escalate to a pricier one if a quality check fails.

when it helps: tasks that don't need frontier capability — classification, extraction, routing, simple drafting

Prompt caching — stop paying for the same context twice

If many requests share a long, identical prefix (a big system prompt, a shared document), prompt caching lets the provider reuse it at a reduced input rate. OpenAI does this automatically for prompts at or above 1,024 tokens with no code change; Anthropic and Google bill cache reads at roughly 10% of the base input price (cache writes carry a premium or a storage/TTL fee). It cuts input cost and latency — but only for the repeated portion.

when it helps: a fixed system prompt or shared context reused across many calls

Batch API — trade speed for a discount

If results don't need to be instant, send work asynchronously. OpenAI's Batch API and Azure OpenAI Batch run at about 50% off input and output; AWS Bedrock batch inference is around 50% of on-demand for eligible models — in exchange for up to a ~24-hour turnaround. Great for offline jobs, terrible for anything a user is waiting on.

when it helps: nightly enrichment, bulk classification, evals, report generation

Prompt compression — send fewer tokens

Because you pay per token, the cheapest token is the one you never send. Tightening prompts, dropping redundant context, and capping output length all reduce billed tokens directly. Since output is usually the pricier side, limiting how much the model writes (concise instructions, output caps) is an especially effective version of this lever.

when it helps: verbose prompts, oversized context windows, unbounded generations

Semantic cache — answer near-duplicates without calling the model

A semantic cache (like the open-source GPTCache) embeds each incoming query and checks whether a sufficiently similar one was already answered. If so, it serves the stored answer and skips the LLM call entirely. That's different from prompt caching, which still runs the model — here you avoid the request altogether for repetitive traffic.

when it helps: FAQ-style or repetitive workloads where users ask the same things in different words

Two levers not in the calculator deserve a flag. Provisioned/reserved capacity (Azure PTU, Bedrock provisioned throughput) bills a flat hourly rate regardless of tokens — it only beats per-token pricing at high, steady volume, and idle reserved capacity is wasted money. Quantization and distillation shrink a model to run cheaper, but they're self-hosting levers with accuracy trade-offs, not API-side switches.

03Drive the meter: a token-cost & savings lab

Set your workload, pick a model tier, and watch the monthly bill. Then switch on the levers one at a time and see each chip away at the cost — and the running total saved. The per-token prices here are illustrative placeholders chosen to show realistic relationships (output costs more than input; tiers differ by an order of magnitude); they are not any vendor's live rates.

InteractiveDrag the sliders, toggle the levers

Your workload

Requests per day10,000

Input tokens / request1,500

Output tokens / request500

Model tier

Baseline monthly bill

Input (0 tok/req)$0

Output (0 tok/req)$0

Optimized monthly bill

cumulative saved
0%

Illustrative model only. Lever effects use representative, vendor-reported assumptions (e.g. caching reduces the cached input portion; batch ~50% off; semantic cache skips a share of calls) — actual savings depend entirely on your prompt structure, cache-hit rate, and current provider pricing.

04You can't cut what you can't see

Every lever above assumes you actually know where the money goes — and most teams don't, until they instrument for it. FinOps for LLMs starts with visibility: log token usage and cost per request, then attribute it to a team, feature, or user. Open-source proxies and observability tools (Helicone is one) sit in front of the API and record latency, result, and cost on every call, which is what lets you compute unit economics — cost per request, per active user, per feature — instead of staring at one big monthly invoice. With that in hand you can spot the runaway feature, prove a lever worked, and decide whether a workload's value justifies its spend.

Cost decisions are also governance decisions. The NIST AI Risk Management Framework (its Govern, Map, Measure, Manage functions) gives a structure for accountable AI resource decisions, and the FinOps Foundation's cross-functional model answers the human question underneath it: who owns and answers for AI spend? Engineering, finance, and product looking at the same per-feature numbers is what turns one-off cost-cutting into a durable practice.

Instrument first: per-request token + cost logging is the foundation every other lever stands on.
Attribute the cost: tie spend to teams, features, and users to find where it actually concentrates.
Govern the decision: a framework (NIST AI RMF) plus cross-functional ownership keeps cost-and-value accountable over time.

05Check your understanding

TJS Quiz

06Take it with you & go deeper

"FinOps for LLMs" — one-page cheat-sheet

The whole module distilled to a printable summary.

▸ Already on the site — go deeper

Live lesson

LLM Routing & Gateways

The infrastructure that makes model cascades and per-request cost control practical.

Read →

Live lesson

Inference Optimization

KV-cache, batching, and latency — the serving-side levers behind cheaper self-hosted inference.

Read →

▸ Coming next — deeper progression

Coming soon

Small Language Models & On-Device AI

When a smaller, cheaper model — even one running locally — is the right call.

Coming soon

LLMOps: Monitoring & Observability

The instrumentation layer that turns per-request cost logging into a durable practice.

Coming soon

⊕Concept map

The whole lesson at a glance — how the bill is built, the levers that move it, and what keeps cost accountable. Expand a branch to review.

How the bill is built: tokens, in and out

LLM APIs meter cost per token, normally quoted per million tokens — cost scales with text volume, not the number of calls.
Input (prompt) and output (generated) tokens are billed at separate rates; output usually costs more, so trimming generation length pays off.
Mature teams treat it as a value question — unit economics (cost per request, user, feature) weighed against value, as the FinOps Foundation frames it.

The five levers that move the bill

Right-size the model — the single largest lever; a cascade (FrugalGPT) routes to the cheapest model first and escalates only on a failed quality check.
Prompt caching reuses a repeated prefix at a reduced input rate; batch (async) trades speed for roughly 50% off non-urgent work.
Prompt compression sends fewer tokens; a semantic cache (e.g. GPTCache) skips the model entirely on near-duplicate queries.
Two off-calculator levers: provisioned/reserved capacity (only wins at high steady volume) and quantization/distillation (a self-hosting lever with accuracy trade-offs).

Drive the meter: a token-cost & savings lab

Set requests per day, input and output tokens, and a model tier, then watch the baseline monthly bill build from those inputs.
Toggle each lever to see it chip away at the cost and a running cumulative-saved total.
The per-token prices are illustrative placeholders chosen to show realistic relationships (output > input; tiers differ by an order of magnitude), not any vendor's live rates.

You can't cut what you can't see

Instrument first: log token usage and cost per request (proxies/observability like Helicone) to compute unit economics instead of one big invoice.
Attribute the cost to a team, feature, or user to find where spend actually concentrates.
Govern the decision: the NIST AI RMF (Govern/Map/Measure/Manage) plus cross-functional ownership keeps cost-and-value accountable over time.

Continue your path

Where to go next

You just finished AI Cost Optimization (FinOps for LLMs). Here’s a natural progression — from what builds directly on it to where to go deeper.

Foundations→Language & models→Agentic ✓→Governance

Recommended next

Small Language Models & On-Device AI

Continue with Small Language Models & On-Device AI.

Open lesson →

Build on this

Agentic~11 min

Model Serving & Deployment Patterns

+What you’ll learnHide

Continue with Model Serving & Deployment Patterns.

Open lesson →

Agentic~10 min

Model Context Protocol

+What you’ll learnHide

What MCP is, how hosts, clients and servers connect, and why it matters.

Open lesson →

Agentic~10 min

AI Agents

+What you’ll learnHide

How agents perceive, reason, use tools and act, and how they differ from chatbots.

Open lesson →

Agentic~8 min

RAG

+What you’ll learnHide

How retrieval grounds LLM answers, step by step.

Open lesson →

Go deeper

Agentic~13 min

LLM Routing & Gateways

+What you’ll learnHide

Continue with LLM Routing & Gateways.

Open lesson →

Agentic~12 min

Inference Optimization (KV-Cache, Batching, Latency)

+What you’ll learnHide

Continue with Inference Optimization (KV-Cache, Batching, Latency).

Open lesson →

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established cost-optimization mechanisms and is grounded in the references below. All per-token prices, discount percentages, and capacity rates are vendor-reported and change frequently — figures in the calculator are illustrative and labelled as such; confirm current pricing at the source before budgeting.

FinOps for AI — Overview — FinOps Foundation
Optimizing GenAI Usage: Cost, Performance, and Efficiency — FinOps Foundation
How to Build a GenAI Cost and Usage Tracker — FinOps Foundation
OpenAI API Pricing — OpenAI
Prompt Caching guide — OpenAI
Batch API guide — OpenAI
Prompt Caching (Claude API) — Anthropic
Context Caching (Gemini API) — Google
Amazon Bedrock Pricing — AWS
Batch inference (Amazon Bedrock) — AWS
Provisioned throughput for Foundry Models — Microsoft Learn
FrugalGPT: Reducing Cost & Improving Performance (arXiv:2305.05176) — Chen, Zaharia, Zou (Stanford)
GPTCache — Semantic cache for LLMs — Zilliz
Cost Tracking & Optimization — Helicone
AI Risk Management Framework (AI RMF) — NIST

Responsible use

This is an educational explainer. The calculator is an illustrative model, not a quote — real costs depend on your traffic, prompt structure, cache-hit rate, and each provider's current pricing, which changes often. Cost decisions can also carry quality, privacy, and reliability trade-offs (e.g. a cheaper model or aggressive caching may reduce answer quality). Verify pricing and eligibility on official vendor pages, and weigh cost against the value and risk of each workload before acting on it.

AI cost optimization (FinOps for LLMs) — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

How the bill is built

LLM APIs charge per token, normally quoted per million, with separate input and output rates. Output usually costs more than input, so controlling generation length is a primary lever. Mature teams track cost per request/user/feature (unit economics), not just the total bill.

The five cost levers

Right-size the model (biggest lever — tiers differ by ~order of magnitude; cascade cheap→expensive). Prompt caching (reuse a repeated prefix at a reduced input rate). Batch API (~50% off for async work). Prompt compression (send fewer tokens; cap output). Semantic cache (skip the model on near-duplicate queries).

Trade-offs

Reserved/provisioned capacity (Azure PTU, Bedrock provisioned) is a flat hourly charge — cheaper only at high, steady volume. Quantization/distillation are self-hosting levers with accuracy trade-offs, not API switches.

Measure & govern

Instrument per-request token + cost logging and attribute it to teams/features. Use the NIST AI RMF (Govern/Map/Measure/Manage) plus cross-functional FinOps ownership to keep cost-and-value accountable.

Prices change

Every rate and discount is vendor-reported and updates often. Treat all figures as illustrative; verify on the live vendor pricing page before budgeting.

Gallery

Contacts

Cutting the bill: FinOps for LLMs

01How the bill is built: tokens, in and out

02The five levers that move the bill

Right-size the model — the single largest lever

Prompt caching — stop paying for the same context twice

Batch API — trade speed for a discount

Prompt compression — send fewer tokens

Semantic cache — answer near-duplicates without calling the model

03Drive the meter: a token-cost & savings lab

04You can't cut what you can't see

05Check your understanding

06Take it with you & go deeper

LLM Routing & Gateways

Inference Optimization

Small Language Models & On-Device AI

LLMOps: Monitoring & Observability

⊕Concept map

Where to go next

AI cost optimization (FinOps for LLMs) — in 5 minutes

How the bill is built

The five cost levers

Trade-offs

Measure & govern

Prices change

Services

Learn

Company