Making models serve faster
A trained model is only half the battle. Serving it to thousands of users quickly and cheaply is its own engineering discipline. This lesson covers the ideas behind fast inference — the KV cache, the prefill-vs-decode split, continuous batching, and speculative decoding — and lets you toggle each optimization on a serving-throughput sandbox right here on the page.
01One request, two very different phases
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
Generating an answer from a language model happens in two phases, and they behave so differently that almost every optimization targets one or the other. Prefill reads your entire prompt at once: every input token is processed in parallel in a single pass, which keeps the GPU's math units busy — it is compute-bound. Decode then generates the reply one token at a time, feeding each new token back in to produce the next. Each decode step does very little math but has to re-read the model's weights from memory, so it is starved for memory bandwidth rather than compute. That single split — a fast, compute-heavy prefill followed by a slow, bandwidth-limited decode — is the key to understanding everything else here.
Prefill compute-bound
Processes the whole prompt in parallel in one forward pass. Saturates GPU compute even at small batch sizes.
Decode memory-bound
Generates output tokens one at a time, autoregressively. Low compute use; limited by reading weights from memory.
- Prefill is parallel and compute-bound; it sets how long you wait for the first token (TTFT).
- Decode is sequential and memory-bandwidth-bound; it sets the speed of every following token (TPOT).
- Most serving optimizations make sense only once you know which phase they help.
02The KV cache: don't redo work you've already done
During decode, the model's attention step needs the key and value vectors for every token seen so far. Re-computing those from scratch at every step would be enormously wasteful, so they are stored in a key-value (KV) cache and reused. The catch: the cache grows linearly with the sequence length and the number of requests in the batch, and it is usually the largest non-weight consumer of GPU memory during serving. Whoever controls KV-cache memory controls how many requests you can serve at once.
Two families of techniques help. The first changes how the cache is laid out: PagedAttention (the core of the vLLM engine) splits the cache into fixed-size blocks that can live non-contiguously in memory — the same trick operating systems use for virtual-memory paging — which cuts fragmentation and lets requests share blocks. The second shrinks the cache itself: multi-query and grouped-query attention let many query heads share a smaller set of key/value heads, while eviction policies (H2O) and low-precision quantization (KIVI, low-bit GGUF) trade a little quality for a much smaller footprint — and a smaller cache means room for larger batches.
- The KV cache stores past key/value vectors so the model never recomputes them — but it grows with sequence length and batch size.
- PagedAttention manages the cache like OS memory pages: less fragmentation, flexible sharing across requests.
- MQA/GQA shrink the cache by sharing K/V heads; eviction and quantization shrink it further, at some quality cost.
03Sandbox: toggle the optimizations, watch the numbers move
Below is a single request being served, drawn as a GPU timeline: a short compute-bound prefill block, then a long string of memory-bound decode steps. Flip each optimization on and off and watch the latency and throughput estimates respond. The numbers are illustrative — they show the direction and rough shape of each technique's effect, not a benchmark of any specific model or GPU.
Illustrative estimates — relative effects only, not a benchmark. Real numbers depend on model, hardware, and workload.
- KV cache mainly speeds up per-token decode (TPOT) by avoiding recomputation.
- Continuous batching packs many requests onto the GPU, lifting throughput far more than single-request latency.
- Speculative decoding verifies several drafted tokens per target pass, lowering TPOT without changing the output.
04Continuous batching: keep the GPU full
Because decode is memory-bandwidth-bound, running a single request leaves most of the GPU idle. The fix is to serve many requests at once. Naive (static) batching waits for a whole group of requests to finish before starting the next group — so the entire batch is held hostage by its longest sequence, wasting GPU time as shorter ones sit done-but-waiting. Continuous batching (also called iteration-level scheduling or in-flight batching, introduced by the Orca system) instead schedules work one decode step at a time: as soon as any sequence finishes, a new request takes its slot. The batch is continuously refilled, GPU idle time drops, and throughput rises sharply — which is why production engines such as vLLM, TensorRT-LLM, and TGI build on this idea.
There is one more wrinkle. Prefill is compute-bound and decode is memory-bound, so mixing them well matters. Chunked prefill (SARATHI) splits a long prompt into chunks and piggybacks ongoing decode steps onto them to keep both resources busy, while disaggregation (DistServe) goes further and runs prefill and decode on different GPUs so they stop interfering with each other and each phase can be tuned independently.
- Static batching wastes time: the batch waits for its longest sequence to finish.
- Continuous batching refills slots per decode step, cutting idle time and lifting throughput.
- Chunked prefill and prefill/decode disaggregation stop the two phases from starving each other.
05Speculative decoding — and exact vs. approximate
Decode is slow because each step produces just one token yet still pays to read the whole model. Speculative decoding exploits this: a small, fast draft model proposes several candidate tokens, and the large target model verifies them all at once in a single forward pass — which in the bandwidth-bound decode regime costs roughly the same as generating one token. A modified rejection-sampling step guarantees the result matches what the target model would have produced on its own, so the output distribution is preserved exactly. A related approach, Medusa, drops the separate draft model and instead adds extra decoding heads that predict several future tokens in parallel, verified together with tree-based attention.
One distinction is worth holding onto. Some optimizations are exact — speculative decoding, PagedAttention, and FlashAttention (an IO-aware attention kernel that tiles work to minimize slow memory reads) change speed and memory only, never the output. Others are approximate — KV-cache eviction (H2O) and low-bit quantization (KIVI) shrink memory by discarding or compressing information, which can shift quality. Knowing which bucket a technique falls into tells you whether to expect identical answers or a quality trade-off.
- Speculative decoding drafts several tokens cheaply and verifies them in parallel; the output is exactly the target model's.
- Medusa reaches the same goal with extra decoding heads instead of a separate draft model.
- Exact methods (speculative decoding, PagedAttention, FlashAttention) never change outputs; approximate methods (eviction, quantization) can.
06Knowledge check
07Continue learning
⊕Concept map
A bird's-eye view of inference optimization — expand each branch to see the key ideas from this lesson.
Two phases: prefill vs. decode
- Prefill processes the whole prompt in parallel and is compute-bound, saturating the GPU even at small batch sizes.
- Decode generates tokens one at a time and is memory-bandwidth-bound with low compute utilization.
- Each phase has its own latency metric: time to first token (TTFT) for prefill, time per output token (TPOT) for decode.
The KV cache
- Key/value tensors from previous tokens are cached so they aren't recomputed each decoding step.
- It grows linearly with sequence length and batch size and is usually the dominant non-weight consumer of GPU memory.
- Footprint shrinks via MQA/GQA (shared K/V heads), eviction of all but recent + heavy-hitter tokens (H2O), and quantization (KIVI 2-bit) — smaller caches allow larger batches.
Memory & attention tricks
- PagedAttention partitions the KV cache into fixed-size blocks (OS-paging analogy), cutting fragmentation and enabling block sharing — the core of vLLM.
- FlashAttention is an IO-aware, exact attention algorithm that tiles to minimize HBM↔SRAM transfers; FlashAttention-2 improves parallelism.
Continuous batching
- Also called iteration-level or in-flight batching: schedules work per decode iteration, not per whole request.
- New requests join the running batch as sequences finish, instead of waiting for the longest one (the waste in static batching).
- Origin: Orca (OSDI'22); used in vLLM, TensorRT-LLM, and TGI. SARATHI chunks prefill onto decode batches; DistServe disaggregates the two phases onto different GPUs.
Speculative decoding & exact vs. approximate
- A small draft model proposes several tokens; the large model verifies them in parallel, and rejection sampling preserves the exact output distribution.
- Medusa adds extra decoding heads to predict future tokens without a separate draft model, verifying candidates with tree attention.
- Speculative decoding, FlashAttention, and PagedAttention are exact; KV eviction (H2O) and low-bit quantization are approximations that can affect quality — don't conflate them.
Continue your path
Where to go next
You just finished Inference Optimization (KV-Cache, Batching, Latency). Here’s a natural progression — from what builds directly on it to where to go deeper.
Continue with Small Language Models & On-Device AI.
Agentic~12 min
AI Cost Optimization (FinOps for LLMs)
+What you’ll learnHide
Continue with AI Cost Optimization (FinOps for LLMs).
Open lesson →
Agentic~11 min
Model Serving & Deployment Patterns
+What you’ll learnHide
Continue with Model Serving & Deployment Patterns.
Open lesson →
Agentic~10 min
Model Context Protocol
+What you’ll learnHide
What MCP is, how hosts, clients and servers connect, and why it matters.
Open lesson →
Agentic~10 min
AI Agents
+What you’ll learnHide
How agents perceive, reason, use tools and act, and how they differ from chatbots.
Open lesson →
Language~13 min
The Attention Mechanism (Deep Dive)
+What you’ll learnHide
Continue with The Attention Mechanism (Deep Dive).
Open lesson →Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts grounded in the primary references below. Speedup and throughput figures in the papers are measured on specific models, hardware, and workloads — treat any single number as illustrative and read it in the context of its source. The serving-engine landscape moves quickly, so feature claims are tied to the cited docs and may change between releases.
- Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al., SOSP 2023
- FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al., NeurIPS 2022
- Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al., OSDI 2022
- Fast Inference from Transformers via Speculative Decoding — Leviathan et al., ICML 2023
- Accelerating LLM Decoding with Speculative Sampling — Chen et al., DeepMind
- Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads — Cai et al., 2024
- GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al., EMNLP 2023
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference — Zhang et al., NeurIPS 2023
- SARATHI: Efficient LLM Inference with Chunked Prefills — Agrawal et al., Microsoft Research
- DistServe: Disaggregating Prefill and Decoding — Zhong et al., OSDI 2024
- Paged Attention (vLLM Design Documentation) — vLLM project
- TensorRT-LLM Overview — NVIDIA
- Text Generation Inference (TGI) Documentation — Hugging Face
- How Continuous Batching Enables 23x Throughput in LLM Inference — Anyscale (secondary)
Inference optimization — in one page
Tech Jacks Solutions · AI Knowledge Hub · educational summary
Two phases
Inference splits into prefill (reads the whole prompt in parallel; compute-bound; sets TTFT — time to first token) and decode (generates one token at a time; memory-bandwidth-bound; sets TPOT — time per output token).
The KV cache
Stores past tokens' key/value vectors so they aren't recomputed each step. It grows linearly with sequence length and batch size and is usually the largest non-weight user of GPU memory. PagedAttention pages it like OS memory to cut fragmentation; MQA/GQA, eviction (H2O), and quantization (KIVI) shrink it.
Continuous batching
Static batching wastes GPU time waiting on the longest sequence. Continuous batching (iteration-level / in-flight scheduling, from Orca) refills batch slots per decode step, raising throughput. Chunked prefill (SARATHI) and disaggregation (DistServe) stop prefill and decode from starving each other.
Speculative decoding
A small draft model proposes several tokens; the large target model verifies them in parallel (cheap because decode is bandwidth-bound). The output distribution is preserved exactly. Medusa uses extra decoding heads instead of a draft model.
Exact vs. approximate
Exact (speculative decoding, PagedAttention, FlashAttention) change speed/memory only — outputs unchanged. Approximate (KV eviction, low-bit quantization) shrink memory but can affect quality.