Agentic learning lesson

Track · Agentic Intermediate ~8 min

LLM routing & gateways

One model is rarely the right answer for every request. A router decides which model should answer each query; a gateway is the layer that actually runs it and handles auth, caching, and failover. Learn how the two fit together — and watch a router trade cost against quality, live on the page.

Module progress

01Two different jobs: the gateway and the router

These two words get used interchangeably in marketing, but they do different jobs. An LLM gateway (also called an AI gateway or LLM proxy) is infrastructure that sits between your application and one or more model providers. It exposes a single API — usually OpenAI-compatible — and adds the cross-cutting plumbing every app needs: key management, load balancing, retries and fallbacks, caching, rate limiting, logging, and guardrails. Tools like LiteLLM, Portkey, Cloudflare AI Gateway, and Kong are gateways. A router is narrower: it is the decision component that chooses which model should handle a given request, weighing cost, quality, and latency. Research routers like RouteLLM and managed ones like Amazon Bedrock Intelligent Prompt Routing are about that decision.

The cleanest way to hold the distinction: the gateway is the execution and observability layer; the router is the model-selection policy. A gateway often includes simple routing (ordering, weighted load balancing, fallbacks), and many products do both — but the responsibilities are separate, and they compose. A common pattern is a smart router choosing the model and a gateway running it, logging it, and falling back if it fails.

Gateway = the control plane

One endpoint in front of many providers. Handles auth, caching, rate limits, fallbacks, and logging so every team isn't re-solving the same plumbing. Per LiteLLM, OpenRouter, Cloudflare, Kong docs.

Router = the decision policy

For each request, predicts which model gives the best cost / quality / latency outcome and sends it there. Per RouteLLM, Not Diamond, Bedrock prompt routing docs.

A gateway standardises access and reliability across providers behind one API.
A router picks the model per request to optimise cost, quality, or latency.
They are complementary — a router is the policy, a gateway is the layer that executes and observes it.

02Why route at all? The cost–quality trade-off

Routing exists because no single model is best for everything. Stronger models are more capable but cost more and are often slower; smaller models are cheaper and faster but weaker on hard problems. A lot of real traffic is easy — and sending an easy question to a top-tier model wastes money. The core idea behind routing is to predict how hard each query is (or how big the quality gap between models would be) and send easy queries to a cheap model and hard queries to a strong one. That moves you along a tunable cost–quality frontier: you choose how much quality you're willing to trade for savings.

How much can this save? Studies and vendors report meaningful numbers on specific datasets — RouteLLM reports more than a 2× cost reduction in some cases without measurable quality loss; the Hybrid LLM paper reports up to roughly 40% fewer large-model calls with no quality drop; FrugalGPT reports up to around 98% cost reduction while matching the best single model on certain tasks; and Amazon claims up to about 30% cost reduction for Bedrock prompt routing. Treat all of these as best-case, reported results on particular workloads — not guarantees for your traffic.

Capability, cost, and latency pull in different directions — routing balances them per request.
Most savings come from sending the easy majority of queries to a cheaper model.
Reported savings (2×, ~40%, ~98%, ~30%) are dataset-specific best cases — verify on your own traffic.

03See it work: route a stream of requests

Here is the idea made concrete. Requests of varying difficulty arrive at the gateway. Pick a routing policy and send traffic. Each policy makes a different trade: cost-optimized sends easy and medium queries to a cheap small model and only escalates the hard ones; quality-optimized escalates anything non-trivial to the strong model; latency-optimized favours the fastest path — the small model and cache hits — accepting lower quality. Send the same stream under each and watch cost, average quality, and average latency land in different places, see cache hits skip the model entirely, and watch the gateway fail over to a backup provider when a call errors. The numbers here are illustrative — chosen to show the mechanism, not to quote any provider's prices.

InteractivePick a policy, then send requests

cheap model strong model cache hit failover

Requests

Cost (credits)

—

Avg quality

—

Avg latency

Cache hits

Pick a policy and press Send request to route traffic through the gateway.

Cost, quality, and latency trade off against each other — the same request stream lands in a different place under each policy, and the policy decides which axis you optimise.
Caching answers repeat questions without calling any model, cutting both cost and latency (a gateway feature).
Failover to a backup provider on an error is a reliability feature, separate from cost/quality routing.

04Five ways a router decides

"Routing" isn't one technique. Real systems use several, and they're often combined. Switch between them to see how each one chooses a model and where you'll meet it in practice.

InteractiveSwitch the strategy

Cascade — try a cheap model first, escalate if needed

Send the query to a cheap model, then run a confidence or quality check. If it passes, you're done cheaply; if it fails, escalate to a stronger model. This is the LLM cascade described in FrugalGPT.

idea: cheap model answers → check quality → only escalate the ones that fail

Predictive / learned — decide before any call

A trained classifier predicts the best model per query before making a single model call, using patterns learned from data (e.g., preference data). RouteLLM, Hybrid LLM, and Not Diamond work this way.

idea: a small model reads the query → predicts strong-vs-weak → routes once

Semantic — route by what the prompt is about

Turn the prompt into an embedding, compare it to descriptions of each target (or model) stored in a vector database, and route by similarity above a threshold. This is how Kong's AI Proxy Advanced does semantic routing.

idea: embed prompt → nearest target by meaning → route there

Conditional — route on request metadata

Rules decide based on request parameters or metadata: user tier, region, model version, or a guardrail outcome. Deterministic and easy to reason about; Portkey calls this conditional routing, and it composes with fallbacks and load balancing.

idea: if user is "enterprise" → premium model; else → standard model

Load balancing — spread traffic for availability

Distribute requests across multiple model deployments by weight (weights typically normalised to 100%) so no single backend is a bottleneck or single point of failure. Supported by LiteLLM, Portkey, and OpenRouter.

idea: 70% to deployment A, 30% to B → higher availability

05Failover, evaluation, and managed options

Two more pieces complete the picture. First, fallbacks: when a request fails — a connection error, a 404, a 429 rate-limit, or a timeout — the gateway routes to an alternate model or provider. Implementations use ordered lists with per-order retries before escalating to the next option (LiteLLM order-based fallbacks, OpenRouter model fallbacks, Portkey fallbacks). This is about reliability, and it's worth keeping mentally separate from cost/quality routing: one keeps you up when a provider has a bad day, the other saves money on normal days.

Second, how do you know a router is any good? Because there was no standard way to compare routers, RouterBench introduced a benchmark, a theoretical framework, and a dataset of over 405,000 inference outcomes so routers can be compared on the cost–quality Pareto frontier. If you'd rather not build routing yourself, there are managed and commercial options: Amazon Bedrock Intelligent Prompt Routing (serverless; predicts per-request quality and routes within a single model family; optimised for English; GA in 2025), and commercial routers like Not Diamond (pre-trained and custom routers) and Martian (an OpenAI-compatible router/gateway with max-cost controls and failover).

Fallbacks ≠ cost routing

Failover triggers on errors (connection, 404, 429, timeout) and walks an ordered list with retries. It's a reliability mechanism, not a savings one.

Watch the constraints

Bedrock routes only within one model family and is tuned for English. Model lists and prices change constantly — treat any specific counts or names as point-in-time.

Fallbacks = ordered alternates + retries on errors; a reliability feature distinct from cost/quality routing.
RouterBench gives a standard way to compare routers on the cost–quality frontier (405k+ outcomes).
Managed routers (Bedrock, Not Diamond, Martian) trade flexibility for less to build — mind their constraints.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"LLM routing & gateways" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

Model Context Protocol, explained

How agents connect to tools and data through a standard interface — the layer routed models often act through.

Read →

Live lesson

How large language models work

Understand what makes one model stronger or cheaper than another — the difference a router exploits.

Read →

▸ Coming next — deeper progression

Coming soon

Building an agentic RAG pipeline

Where routing, retrieval, and tool use meet inside a single agent loop.

Coming soon

Evaluating LLM systems

How to measure quality so a router's decisions can be trusted and compared.

Coming soon

⊕Concept map

A bird's-eye view of LLM routing and gateways — expand each branch to see the key ideas from this lesson.

Gateway vs. router

A gateway (LiteLLM, Portkey, Cloudflare, Kong) is middleware exposing one OpenAI-compatible API and adding auth, caching, rate limiting, and observability.
A router is the decision component that chooses which model or provider handles a given request.
The two are complementary and often composed: router as the policy, gateway as the execution/observability layer.

Why route: the cost-quality trade-off

No single model is optimal for all tasks: stronger models are more capable but pricier/slower; weaker ones are cheaper/faster.
A router predicts query difficulty and sends easy queries to a cheap model, hard queries to a strong one.
This trades quality for cost along a tunable Pareto frontier (RouteLLM, Hybrid LLM, FrugalGPT, RouterBench).

Five ways a router decides

Cascade: try a cheap model first, escalate if a quality check fails (FrugalGPT).
Predictive/learned: a trained classifier picks the best model per query before any call (RouteLLM, Hybrid LLM, Not Diamond).
Semantic: embed the prompt and route by similarity to target descriptions (Kong). Rule/conditional routing keys on metadata like user tier or region (Portkey); load balancing distributes by weight.

Failover & evaluation

Fallbacks route to an alternate model/provider on errors (connection, 404, 429, timeouts) using ordered lists with per-order retries — a reliability feature distinct from cost/quality routing.
RouterBench (arXiv 2403.12031) is a standard benchmark with a 405k+ inference-outcome dataset for comparing routers on the cost-quality frontier.

Managed & commercial routers

Amazon Bedrock Intelligent Prompt Routing: serverless, predicts per-request quality and routes within one model family (GA 2025, English-optimized).
Not Diamond offers pre-trained and custom routers with Pareto optimization; Martian is an OpenAI-compatible router/gateway with max-cost controls and failover.
Vendor cost-savings figures are best-case or marketing claims — treat as reported, not guaranteed.

Continue your path

Where to go next

You just finished LLM Routing & Gateways. Here’s a natural progression — from what builds directly on it to where to go deeper.

Foundations→Language & models→Agentic ✓→Governance

Recommended next

AI Cost Optimization (FinOps for LLMs)

Continue with AI Cost Optimization (FinOps for LLMs).

Open lesson →

Build on this

Agentic~11 min

Model Serving & Deployment Patterns

+What you’ll learnHide

Continue with Model Serving & Deployment Patterns.

Open lesson →

Agentic~10 min

Model Context Protocol

+What you’ll learnHide

What MCP is, how hosts, clients and servers connect, and why it matters.

Open lesson →

Agentic~10 min

AI Agents

+What you’ll learnHide

How agents perceive, reason, use tools and act, and how they differ from chatbots.

Open lesson →

Agentic~8 min

RAG

+What you’ll learnHide

How retrieval grounds LLM answers, step by step.

Open lesson →

Go deeper

Language~8 min

How LLMs work (tokens)

+What you’ll learnHide

Tokens, attention, training and inference, in plain language.

Open lesson →

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below. Cost-savings figures (RouteLLM >2×, FrugalGPT up to ~98%, Hybrid LLM ~40%, Bedrock ~30%) are best-case results on specific datasets or vendor claims and are attributed as such; the simulator's numbers are illustrative.

RouteLLM: Learning to Route LLMs with Preference Data — Ong et al. (LMSYS / UC Berkeley)
FrugalGPT: How to Use LLMs While Reducing Cost — Chen, Zaharia & Zou (Stanford)
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing — Ding, Mallick et al. (Microsoft)
RouterBench: A Benchmark for Multi-LLM Routing System — Hu, Bieker et al. (Martian)
Router & Load Balancing — LiteLLM
Fallbacks (Proxy Reliability) — LiteLLM
Conditional Routing — Portkey AI
AI Gateway Overview — Cloudflare
Intelligent Prompt Routing — Amazon Bedrock
What is Model Routing? — Not Diamond

Responsible use

This is an educational explainer. Routers, gateways, and the cost/quality figures cited can change as models and pricing evolve — verify against the linked primary sources before making architecture or spend decisions. The interactive simulator uses invented credit values to illustrate the mechanism and does not reflect any provider's actual pricing or model behaviour. AI systems can produce plausible-sounding but incorrect output; for decisions with real consequences, validate results and apply appropriate expertise. See the NIST AI Risk Management Framework for governance guidance.

LLM routing & gateways — in 8 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Gateway vs. router

A gateway (LiteLLM, Portkey, Cloudflare, Kong) is one API in front of many providers, adding auth, caching, fallbacks, rate limits, and logging. A router (RouteLLM, Not Diamond, Bedrock) decides which model handles each request. Gateway = execution layer; router = decision policy. They compose.

Why route

No model is best for everything: strong models cost more, small models are cheaper. Route easy queries to a cheap model and hard ones to a strong model, trading quality for cost along a tunable frontier. Reported savings (RouteLLM >2×, FrugalGPT up to ~98%, Hybrid LLM ~40%, Bedrock ~30%) are best-case, dataset-specific claims.

How routers decide

Cascade: try cheap first, escalate on a failed quality check. Predictive: a trained classifier picks the model before any call. Semantic: embed the prompt, route by similarity. Conditional: route on metadata (tier, region). Load balancing: spread by weight for availability.

Reliability & evaluation

Fallbacks retry an ordered list of alternates on errors (connection, 404, 429, timeout) — a reliability feature, not cost routing. RouterBench (405k+ outcomes) compares routers on the cost-quality frontier. Managed routers (Bedrock, Not Diamond, Martian) trade flexibility for less to build — mind constraints like Bedrock's single-family, English-optimized routing.

Gallery

Contacts

LLM routing & gateways

01Two different jobs: the gateway and the router

02Why route at all? The cost–quality trade-off

03See it work: route a stream of requests

04Five ways a router decides

Cascade — try a cheap model first, escalate if needed

Predictive / learned — decide before any call

Semantic — route by what the prompt is about

Conditional — route on request metadata

Load balancing — spread traffic for availability

05Failover, evaluation, and managed options

06Check your understanding

07Take it with you & go deeper

Model Context Protocol, explained

How large language models work

Building an agentic RAG pipeline

Evaluating LLM systems

⊕Concept map

Where to go next

LLM routing & gateways — in 8 minutes

Gateway vs. router

Why route

How routers decide

Reliability & evaluation

Services

Learn

Company