Model serving: getting a trained model to actually answer requests

Inference optimization — KV-cache, batching & latency

Goes one layer down: the techniques inside each replica that make a single served model faster.

LLMOps: monitoring & observability

Once a model is served, how do you watch it in production — latency, errors, drift, and cost?

▸ Goes well with

LLM routing & gateways

The layer in front of your served models: routing requests across providers and versions.

AI cost optimization (FinOps for LLMs)

Serving mode and scaling choices drive cost — this lesson turns those choices into dollars.

⊕Concept map

The moving parts of model serving at a glance — expand each branch to see how the pieces connect.

What serving is

Exposing a trained model behind an API — typically HTTP/REST or gRPC — so applications send inputs and get predictions back.
It decouples the model from the calling application, so each can change independently.
LLM-focused servers (vLLM, TGI) commonly expose OpenAI-compatible Chat/Completions endpoints for drop-in client compatibility.

Serving modes (real-time / async / batch)

Real-time: synchronous, low-latency requests against a persistent endpoint, for interactive workloads.
Async/streaming: queued asynchronous inference for large payloads or long processing, plus token streaming for generative models.
Batch: asynchronous, offline scoring of large accumulated datasets with no persistent endpoint — the cheapest mode, with input/output via data stores.

Scaling & throughput

Dynamic / continuous request batching groups concurrent requests (dynamic batching in Triton; continuous batching in TGI and vLLM).
Replicas plus autoscaling scale capacity up and down with load (Ray Serve, KServe), including scale-to-zero.
GPU-aware execution — paged and flash attention — raises throughput per accelerator.
Trade-off: bigger batches and more replicas raise throughput but add latency and cost.

Deployment patterns

Canary: route a small percentage of traffic to the new version, then increase if it stays healthy.
Blue/green: provision a parallel green fleet with the update, shift traffic over, then retire blue after a baking period.
Shadow: mirror a copy of live traffic to a candidate while only production responses are returned to callers.
A/B: split traffic across versions to compare outcomes.

Continue your path

Where to go next

You just finished Model Serving & Deployment Patterns. Here’s a natural progression — from what builds directly on it to where to go deeper.

Foundations→Language & models→Agentic ✓→Governance

Recommended next

Model Context Protocol

What MCP is, how hosts, clients and servers connect, and why it matters.

Open lesson →

Build on this

Agentic~10 min

AI Agents

+What you’ll learnHide

How agents perceive, reason, use tools and act, and how they differ from chatbots.

Agentic~8 min

RAG

+What you’ll learnHide

How retrieval grounds LLM answers, step by step.

Agentic~7 min

Chatbots

+What you’ll learnHide

How they understand and respond, their limits, and how they differ from agents.

Agentic~8 min

Model cards

+What you’ll learnHide

What they document, why they matter for transparency, and how to read one.

Go deeper

Agentic~13 min

LLM Routing & Gateways

+What you’ll learnHide

Continue with LLM Routing & Gateways.

Agentic~12 min

Inference Optimization (KV-Cache, Batching, Latency)

+What you’ll learnHide

Continue with Inference Optimization (KV-Cache, Batching, Latency).

Agentic~12 min

AI Cost Optimization (FinOps for LLMs)

+What you’ll learnHide

Continue with AI Cost Optimization (FinOps for LLMs).