Applied & Agentic lesson

Track · Agentic Advanced ~10 min

Making models serve faster

A trained model is only half the battle. Serving it to thousands of users quickly and cheaply is its own engineering discipline. This lesson covers the ideas behind fast inference — the KV cache, the prefill-vs-decode split, continuous batching, and speculative decoding — and lets you toggle each optimization on a serving-throughput sandbox right here on the page.

Module progress

01One request, two very different phases

Generating an answer from a language model happens in two phases, and they behave so differently that almost every optimization targets one or the other. Prefill reads your entire prompt at once: every input token is processed in parallel in a single pass, which keeps the GPU's math units busy — it is compute-bound. Decode then generates the reply one token at a time, feeding each new token back in to produce the next. Each decode step does very little math but has to re-read the model's weights from memory, so it is starved for memory bandwidth rather than compute. That single split — a fast, compute-heavy prefill followed by a slow, bandwidth-limited decode — is the key to understanding everything else here.

Prefill compute-bound

Processes the whole prompt in parallel in one forward pass. Saturates GPU compute even at small batch sizes.

Governs TTFT — time to first token.

Decode memory-bound

Generates output tokens one at a time, autoregressively. Low compute use; limited by reading weights from memory.

Governs TPOT — time per output token.

Prefill is parallel and compute-bound; it sets how long you wait for the first token (TTFT).
Decode is sequential and memory-bandwidth-bound; it sets the speed of every following token (TPOT).
Most serving optimizations make sense only once you know which phase they help.

02The KV cache: don't redo work you've already done

During decode, the model's attention step needs the key and value vectors for every token seen so far. Re-computing those from scratch at every step would be enormously wasteful, so they are stored in a key-value (KV) cache and reused. The catch: the cache grows linearly with the sequence length and the number of requests in the batch, and it is usually the largest non-weight consumer of GPU memory during serving. Whoever controls KV-cache memory controls how many requests you can serve at once.

Two families of techniques help. The first changes how the cache is laid out: PagedAttention (the core of the vLLM engine) splits the cache into fixed-size blocks that can live non-contiguously in memory — the same trick operating systems use for virtual-memory paging — which cuts fragmentation and lets requests share blocks. The second shrinks the cache itself: multi-query and grouped-query attention let many query heads share a smaller set of key/value heads, while eviction policies (H2O) and low-precision quantization (KIVI, low-bit GGUF) trade a little quality for a much smaller footprint — and a smaller cache means room for larger batches.

The KV cache stores past key/value vectors so the model never recomputes them — but it grows with sequence length and batch size.
PagedAttention manages the cache like OS memory pages: less fragmentation, flexible sharing across requests.
MQA/GQA shrink the cache by sharing K/V heads; eviction and quantization shrink it further, at some quality cost.

03Sandbox: toggle the optimizations, watch the numbers move

Below is a single request being served, drawn as a GPU timeline: a short compute-bound prefill block, then a long string of memory-bound decode steps. Flip each optimization on and off and watch the latency and throughput estimates respond. The numbers are illustrative — they show the direction and rough shape of each technique's effect, not a benchmark of any specific model or GPU.

InteractiveToggle the switches

Optimizations

Illustrative estimates — relative effects only, not a benchmark. Real numbers depend on model, hardware, and workload.

KV cache mainly speeds up per-token decode (TPOT) by avoiding recomputation.
Continuous batching packs many requests onto the GPU, lifting throughput far more than single-request latency.
Speculative decoding verifies several drafted tokens per target pass, lowering TPOT without changing the output.

04Continuous batching: keep the GPU full

Because decode is memory-bandwidth-bound, running a single request leaves most of the GPU idle. The fix is to serve many requests at once. Naive (static) batching waits for a whole group of requests to finish before starting the next group — so the entire batch is held hostage by its longest sequence, wasting GPU time as shorter ones sit done-but-waiting. Continuous batching (also called iteration-level scheduling or in-flight batching, introduced by the Orca system) instead schedules work one decode step at a time: as soon as any sequence finishes, a new request takes its slot. The batch is continuously refilled, GPU idle time drops, and throughput rises sharply — which is why production engines such as vLLM, TensorRT-LLM, and TGI build on this idea.

There is one more wrinkle. Prefill is compute-bound and decode is memory-bound, so mixing them well matters. Chunked prefill (SARATHI) splits a long prompt into chunks and piggybacks ongoing decode steps onto them to keep both resources busy, while disaggregation (DistServe) goes further and runs prefill and decode on different GPUs so they stop interfering with each other and each phase can be tuned independently.

Static batching wastes time: the batch waits for its longest sequence to finish.
Continuous batching refills slots per decode step, cutting idle time and lifting throughput.
Chunked prefill and prefill/decode disaggregation stop the two phases from starving each other.

05Speculative decoding — and exact vs. approximate

Decode is slow because each step produces just one token yet still pays to read the whole model. Speculative decoding exploits this: a small, fast draft model proposes several candidate tokens, and the large target model verifies them all at once in a single forward pass — which in the bandwidth-bound decode regime costs roughly the same as generating one token. A modified rejection-sampling step guarantees the result matches what the target model would have produced on its own, so the output distribution is preserved exactly. A related approach, Medusa, drops the separate draft model and instead adds extra decoding heads that predict several future tokens in parallel, verified together with tree-based attention.

One distinction is worth holding onto. Some optimizations are exact — speculative decoding, PagedAttention, and FlashAttention (an IO-aware attention kernel that tiles work to minimize slow memory reads) change speed and memory only, never the output. Others are approximate — KV-cache eviction (H2O) and low-bit quantization (KIVI) shrink memory by discarding or compressing information, which can shift quality. Knowing which bucket a technique falls into tells you whether to expect identical answers or a quality trade-off.

Speculative decoding drafts several tokens cheaply and verifies them in parallel; the output is exactly the target model's.
Medusa reaches the same goal with extra decoding heads instead of a separate draft model.
Exact methods (speculative decoding, PagedAttention, FlashAttention) never change outputs; approximate methods (eviction, quantization) can.

06Knowledge check

TJS Quiz

07Continue learning

"Inference optimization" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

⊕Concept map

A bird's-eye view of inference optimization — expand each branch to see the key ideas from this lesson.

Two phases: prefill vs. decode

Prefill processes the whole prompt in parallel and is compute-bound, saturating the GPU even at small batch sizes.
Decode generates tokens one at a time and is memory-bandwidth-bound with low compute utilization.
Each phase has its own latency metric: time to first token (TTFT) for prefill, time per output token (TPOT) for decode.

The KV cache

Key/value tensors from previous tokens are cached so they aren't recomputed each decoding step.
It grows linearly with sequence length and batch size and is usually the dominant non-weight consumer of GPU memory.
Footprint shrinks via MQA/GQA (shared K/V heads), eviction of all but recent + heavy-hitter tokens (H2O), and quantization (KIVI 2-bit) — smaller caches allow larger batches.

Memory & attention tricks

PagedAttention partitions the KV cache into fixed-size blocks (OS-paging analogy), cutting fragmentation and enabling block sharing — the core of vLLM.
FlashAttention is an IO-aware, exact attention algorithm that tiles to minimize HBM↔SRAM transfers; FlashAttention-2 improves parallelism.

Continuous batching

Also called iteration-level or in-flight batching: schedules work per decode iteration, not per whole request.
New requests join the running batch as sequences finish, instead of waiting for the longest one (the waste in static batching).
Origin: Orca (OSDI'22); used in vLLM, TensorRT-LLM, and TGI. SARATHI chunks prefill onto decode batches; DistServe disaggregates the two phases onto different GPUs.

Speculative decoding & exact vs. approximate

A small draft model proposes several tokens; the large model verifies them in parallel, and rejection sampling preserves the exact output distribution.
Medusa adds extra decoding heads to predict future tokens without a separate draft model, verifying candidates with tree attention.
Speculative decoding, FlashAttention, and PagedAttention are exact; KV eviction (H2O) and low-bit quantization are approximations that can affect quality — don't conflate them.

Continue your path

Where to go next

You just finished Inference Optimization (KV-Cache, Batching, Latency). Here’s a natural progression — from what builds directly on it to where to go deeper.

Foundations→Language & models→Agentic ✓→Governance

Recommended next

Small Language Models & On-Device AI

Continue with Small Language Models & On-Device AI.

Open lesson →

Build on this

Agentic~12 min

AI Cost Optimization (FinOps for LLMs)

+What you’ll learnHide

Continue with AI Cost Optimization (FinOps for LLMs).

Open lesson →

Agentic~11 min

Model Serving & Deployment Patterns

+What you’ll learnHide

Continue with Model Serving & Deployment Patterns.

Open lesson →

Agentic~10 min

Model Context Protocol

+What you’ll learnHide

What MCP is, how hosts, clients and servers connect, and why it matters.

Open lesson →

Agentic~10 min

AI Agents

+What you’ll learnHide

How agents perceive, reason, use tools and act, and how they differ from chatbots.

Open lesson →

Go deeper

Language~13 min

The Attention Mechanism (Deep Dive)

+What you’ll learnHide

Continue with The Attention Mechanism (Deep Dive).

Open lesson →

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts grounded in the primary references below. Speedup and throughput figures in the papers are measured on specific models, hardware, and workloads — treat any single number as illustrative and read it in the context of its source. The serving-engine landscape moves quickly, so feature claims are tied to the cited docs and may change between releases.

Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al., SOSP 2023
FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al., NeurIPS 2022
Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al., OSDI 2022
Fast Inference from Transformers via Speculative Decoding — Leviathan et al., ICML 2023
Accelerating LLM Decoding with Speculative Sampling — Chen et al., DeepMind
Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads — Cai et al., 2024
GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al., EMNLP 2023
H2O: Heavy-Hitter Oracle for Efficient Generative Inference — Zhang et al., NeurIPS 2023
SARATHI: Efficient LLM Inference with Chunked Prefills — Agrawal et al., Microsoft Research
DistServe: Disaggregating Prefill and Decoding — Zhong et al., OSDI 2024
Paged Attention (vLLM Design Documentation) — vLLM project
TensorRT-LLM Overview — NVIDIA
Text Generation Inference (TGI) Documentation — Hugging Face
How Continuous Batching Enables 23x Throughput in LLM Inference — Anyscale (secondary)

Educational use only. This lesson is a conceptual introduction to LLM inference optimization. The latency and throughput figures in the interactive sandbox are illustrative estimates that show the direction and rough shape of each technique's effect — they are not benchmarks of any specific model, GPU, or serving engine. Real numbers depend heavily on hardware and workload. Serving-engine features change frequently; always verify implementation details against the official documentation linked above before building on them. Nothing here is professional engineering or performance advice.

Inference optimization — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Two phases

Inference splits into prefill (reads the whole prompt in parallel; compute-bound; sets TTFT — time to first token) and decode (generates one token at a time; memory-bandwidth-bound; sets TPOT — time per output token).

The KV cache

Stores past tokens' key/value vectors so they aren't recomputed each step. It grows linearly with sequence length and batch size and is usually the largest non-weight user of GPU memory. PagedAttention pages it like OS memory to cut fragmentation; MQA/GQA, eviction (H2O), and quantization (KIVI) shrink it.

Continuous batching

Static batching wastes GPU time waiting on the longest sequence. Continuous batching (iteration-level / in-flight scheduling, from Orca) refills batch slots per decode step, raising throughput. Chunked prefill (SARATHI) and disaggregation (DistServe) stop prefill and decode from starving each other.

Speculative decoding

A small draft model proposes several tokens; the large target model verifies them in parallel (cheap because decode is bandwidth-bound). The output distribution is preserved exactly. Medusa uses extra decoding heads instead of a draft model.

Exact vs. approximate

Exact (speculative decoding, PagedAttention, FlashAttention) change speed/memory only — outputs unchanged. Approximate (KV eviction, low-bit quantization) shrink memory but can affect quality.

Gallery

Contacts

Making models serve faster

01One request, two very different phases

Prefill compute-bound

Decode memory-bound

02The KV cache: don't redo work you've already done

03Sandbox: toggle the optimizations, watch the numbers move

04Continuous batching: keep the GPU full

05Speculative decoding — and exact vs. approximate

06Knowledge check

07Continue learning

⊕Concept map

Where to go next

Inference optimization — in one page

Two phases

The KV cache

Continuous batching

Speculative decoding

Exact vs. approximate

Services

Learn

Company