Model serving: getting a trained model to actually answer requests
Training a model is only half the job. The other half is serving it: putting it behind an API so real traffic can reach it, fast and reliably, without breaking when load spikes or when you ship a new version. Learn the serving modes, the scaling tricks, and the safe-rollout patterns — and drive a live serving stack yourself, right here on the page.
01What "serving" actually means
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
Model serving is the practice of exposing a trained model behind an API — usually HTTP/REST or gRPC — so that applications can send it inputs and get predictions back. The key idea is decoupling: the model runs as its own service, and the apps that use it never need to load the model, manage a GPU, or know how it works internally. They just make a network call. A serving framework like TensorFlow Serving wraps a saved model; TorchServe does the same for PyTorch; and LLM-focused servers such as vLLM and Hugging Face's Text Generation Inference (TGI) commonly expose an OpenAI-compatible endpoint, so existing chat/completions clients work as a drop-in.
- A served model is a service, not a file you import — apps reach it over the network.
- The contract is an API: send inputs, get predictions; common shapes are REST and gRPC.
- LLM servers often speak the OpenAI Chat/Completions shape so clients are interchangeable (per vLLM and TGI docs).
02Three ways to serve: real-time, async, batch
Not every prediction needs to come back in 200 milliseconds. Picking the right serving mode for the workload is the single biggest cost-and-latency decision you'll make. The managed platforms each name these slightly differently, but the same three shapes recur.
Synchronous, low-latency requests against a model on a persistent endpoint. The caller waits for the answer. Best for interactive workloads — chat, search, recommendations.
Queued requests for large payloads or long processing, plus token streaming for generative models so words appear as they're produced rather than all at once.
stream:true · vLLM / Ray Serve response streamingOffline scoring of large accumulated datasets — no persistent endpoint at all. Input and output flow through object or data stores. Cheapest per prediction when latency doesn't matter.
If a human is waiting, lean real-time. If a payload is huge or the job is slow, lean async. If you can wait minutes-to-hours for a whole dataset, lean batch — and pay far less.
autoscale to zero between jobs, cutting idle cost.03How serving stacks stay fast under load
A single GPU can only do so much at once. Production serving frameworks squeeze far more throughput out of the same hardware with a handful of recurring techniques — and the next section lets you turn these knobs yourself.
- Request batching. Instead of running one request at a time, the server groups several arriving requests and runs them together, so the GPU does useful work on every cycle. Triton calls this dynamic batching; TGI and vLLM use continuous batching, which keeps slots full as individual requests finish.
- Replicas + autoscaling. Run multiple copies of the model and add or remove them as load changes. Ray Serve and KServe autoscale replicas up and down; some platforms can scale to zero when idle.
- GPU-aware execution. vLLM and TGI use paged / flash attention to use GPU memory efficiently; Triton can run multiple models concurrently on one GPU.
- The trade-off: bigger batches lift throughput but can add queue wait for each request; more replicas cut wait but cost more and, when started cold, add cold-start latency.
04Drive the serving stack
Requests arrive at a gateway / load balancer, which spreads them across model replicas; each replica groups requests into batches before running them on its GPU. Turn the knobs and watch throughput, queue wait, and cold-start respond. Then flip a canary or blue/green rollout to ship a new version safely. The numbers are illustrative — they model the directions real serving stacks move, not any one product's measured figures.
- Throughput is roughly replicas × batch efficiency — until demand outstrips capacity, when requests queue.
- Big batches raise throughput but add queue wait; tiny batches feel snappy but waste the GPU.
- Autoscaling protects you from spikes, but newly-added replicas pay a cold-start cost before they help.
- Canary, blue/green, and shadow change which version sees traffic — covered next.
05Shipping a new version without breaking prod
Replacing a live model all at once is risky — if the new version is worse, every user feels it instantly. These deployment patterns let you roll out a new version gradually and back out fast. One caution: each platform implements these differently (KServe uses traffic splitting; SageMaker uses fleet-based guardrails), so treat the names as concepts, not a single universal mechanism.
Send a small percentage of traffic to the new version, watch its health, then increase if it holds up. Promote by routing all traffic to it.
canaryTrafficPercent · SageMaker canary traffic-shiftingStand up a parallel "green" fleet running the update, shift traffic over from "blue", let it bake, then terminate blue. Instant rollback if green misbehaves.
Send a copy of live traffic to a candidate version for offline comparison, but only the production version's responses go back to callers. Zero user risk.
Split traffic across versions to compare real outcomes (not just health) — which model gets better engagement or accuracy on live users.
06Check your understanding
07Take it with you & go deeper
Inference optimization — KV-cache, batching & latency
Goes one layer down: the techniques inside each replica that make a single served model faster.
Read →LLMOps: monitoring & observability
Once a model is served, how do you watch it in production — latency, errors, drift, and cost?
Read →⊕Concept map
The moving parts of model serving at a glance — expand each branch to see how the pieces connect.
What serving is
- Exposing a trained model behind an API — typically HTTP/REST or gRPC — so applications send inputs and get predictions back.
- It decouples the model from the calling application, so each can change independently.
- LLM-focused servers (vLLM, TGI) commonly expose OpenAI-compatible Chat/Completions endpoints for drop-in client compatibility.
Serving modes (real-time / async / batch)
- Real-time: synchronous, low-latency requests against a persistent endpoint, for interactive workloads.
- Async/streaming: queued asynchronous inference for large payloads or long processing, plus token streaming for generative models.
- Batch: asynchronous, offline scoring of large accumulated datasets with no persistent endpoint — the cheapest mode, with input/output via data stores.
Scaling & throughput
- Dynamic / continuous request batching groups concurrent requests (dynamic batching in Triton; continuous batching in TGI and vLLM).
- Replicas plus autoscaling scale capacity up and down with load (Ray Serve, KServe), including scale-to-zero.
- GPU-aware execution — paged and flash attention — raises throughput per accelerator.
- Trade-off: bigger batches and more replicas raise throughput but add latency and cost.
Deployment patterns
- Canary: route a small percentage of traffic to the new version, then increase if it stays healthy.
- Blue/green: provision a parallel green fleet with the update, shift traffic over, then retire blue after a baking period.
- Shadow: mirror a copy of live traffic to a candidate while only production responses are returned to callers.
- A/B: split traffic across versions to compare outcomes.
Continue your path
Where to go next
You just finished Model Serving & Deployment Patterns. Here’s a natural progression — from what builds directly on it to where to go deeper.
What MCP is, how hosts, clients and servers connect, and why it matters.
Agentic~10 min
AI Agents
+What you’ll learnHide
How agents perceive, reason, use tools and act, and how they differ from chatbots.
Open lesson →
Agentic~8 min
RAG
+What you’ll learnHide
How retrieval grounds LLM answers, step by step.
Open lesson →
Agentic~7 min
Chatbots
+What you’ll learnHide
How they understand and respond, their limits, and how they differ from agents.
Open lesson →
Agentic~8 min
Model cards
+What you’ll learnHide
What they document, why they matter for transparency, and how to read one.
Open lesson →
Agentic~13 min
LLM Routing & Gateways
+What you’ll learnHide
Continue with LLM Routing & Gateways.
Open lesson →
Agentic~12 min
Inference Optimization (KV-Cache, Batching, Latency)
+What you’ll learnHide
Continue with Inference Optimization (KV-Cache, Batching, Latency).
Open lesson →
Agentic~12 min
AI Cost Optimization (FinOps for LLMs)
+What you’ll learnHide
Continue with AI Cost Optimization (FinOps for LLMs).
Open lesson →Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established serving and deployment concepts grounded in the official vendor documentation below; the numbers in the simulator are illustrative and labelled as such. Deployment-pattern names (canary, blue/green, shadow, A/B) are implemented differently across platforms — treat them as concepts, not a single universal mechanism.
- Text Generation Inference (TGI) — Documentation — Hugging Face
- Triton Inference Server — User Guide — NVIDIA
- vLLM — OpenAI-Compatible Server — vLLM Project
- Canary Rollout Example — KServe (CNCF)
- Ray Serve — Autoscaling Guide — Ray / Anyscale
- Serving Models (TFX) — TensorFlow
- TorchServe — Documentation — PyTorch
- Real-time inference — Amazon SageMaker AI — AWS
- Blue/Green Deployments — Amazon SageMaker AI — AWS
- Testing models with shadow variants — SageMaker AI — AWS
- Get inferences beginner's guide — Vertex AI — Google Cloud
- What are batch endpoints? — Azure Machine Learning — Microsoft
- SeldonIO/seldon-core — Seldon
This is an educational lesson, not a vendor recommendation or an engineering specification. No single serving framework is "best" — the right choice depends on your model's framework lineage, your hardware, your latency needs, and your existing infrastructure. The simulator's throughput, latency, and cold-start figures are illustrative: they model the direction real serving stacks move, not measured numbers from any product.
Before relying on specific limits or behaviours (payload caps, autoscaling rules, traffic-shifting modes), verify them against the current official documentation linked above — open-source serving frameworks evolve rapidly and managed-platform features change frequently. Treat any production deployment decision as something to validate in your own environment.
Model serving & deployment patterns — in one page
Tech Jacks Solutions · AI Knowledge Hub · educational summary
What serving is
Exposing a trained model behind an API (HTTP/REST or gRPC) so apps can send inputs and get predictions. The model runs as its own service; callers just make a network call. LLM servers (vLLM, TGI) often expose an OpenAI-compatible endpoint so existing clients work as a drop-in.
Three serving modes
Real-time — synchronous, low-latency, persistent endpoint; for interactive work. Async / streaming — queued for large payloads, plus token streaming for generative output. Batch — offline scoring of large datasets via data stores, no live endpoint; cheapest when latency doesn't matter.
Staying fast under load
Batching groups requests so the GPU works on many at once (Triton dynamic batching; TGI/vLLM continuous batching). Replicas + autoscaling add/remove copies with load (Ray Serve, KServe). Bigger batches raise throughput but add queue wait; new replicas pay a cold-start cost.
Deployment patterns
Canary — small traffic share to the new version, then ramp up. Blue/green — parallel updated fleet, shift traffic, bake, retire old. Shadow — mirror live traffic to a candidate, only production replies reach users. A/B — split traffic to compare real outcomes. Each platform implements these differently.