How to Run Open-Source LLMs: Prerequisites, Serving and Token Management
Running an open-source model is not one decision. It is a stack: the GPU you put it on, the engine that serves it, the precision you quantize it to, the tokens you push through it, and the lifecycle you keep it healthy in. Get those right and a fine-tuned 8B model can do real work for a fraction of frontier-API cost. Get them wrong and you pay for idle GPUs while a managed API would have been cheaper and calmer.
The honest version: self-hosting open weights gives you data control and low per-token cost at steady volume. It also hands you an operations job. This guide names both sides, and points out where a managed or frontier API is simply the better call.
Prerequisites: Hardware, GPU and MLOps
The first question is whether to self-host at all. If you have steady, high-volume traffic and a reason to keep data in your own environment, self-hosting an open-weight model (one whose trained weights you can download, though its license may still restrict how you use them) pays off. If your volume is spiky or small, a metered API is usually cheaper and far less work. Decide that before you buy a GPU. If you are still building the business case, our breakdown of why teams choose open-source AI models covers the control, cost, and data-residency reasons that justify the operational effort.
Put governance around how your team uses AI. The AI Acceptable Use Policy: a deploy-ready template that sets the rules for AI use.
Your purchase helps keep our hubs free to read.
When you do self-host, model weights load into GPU memory (VRAM) at startup, and the size of the model sets the floor for what hardware you need. The grounded sizing is qualitative: small models in the 1B to 7B range run on consumer GPUs or newer CPUs, medium models from 7B to 70B run on mid-tier cards such as an NVIDIA A10 or L4, and large models of 70B and up need high-end A100 or H200 class hardware.
For a back-of-envelope number, a common rule of thumb (a planning approximation, not a precise figure) is that weight memory in gigabytes is roughly the parameter count in billions times the bytes per parameter: about 2 bytes at 16-bit and about half a byte at 4-bit. Add 20 to 30 percent on top for the KV cache (the cached attention keys and values that grow with context length and can consume 40 to 60 percent of GPU memory) and runtime overhead. So a 7B model wants roughly 14GB at 16-bit or about 4GB at 4-bit, and a 70B model wants roughly 140GB at 16-bit or about 35GB at 4-bit. Treat those as starting points and measure your own workload.
The cost picture is concrete. A single NVIDIA H100 rents for about $2 to $3 per hour on the major clouds, which is roughly $1,500 to $2,200 per month if you keep it running around the clock. A production cluster for a 70B model often needs 4 to 8 GPUs, pushing infrastructure cost to $6,000 to $17,000 or more per month. That is before you account for the people who keep it running. These are living-data figures: GPU rental and per-token rates move with provider, region, and contract, so verify the current numbers on the provider's pricing page before committing budget.
The MLOps side is the part people underestimate. You need someone who can provision GPUs, manage drivers and CUDA, run a serving engine, watch memory, batch requests, roll out model updates, and respond when a node falls over. That is the real prerequisite. Use the checklist below to gauge readiness honestly.
Ways to Serve: vLLM, TGI, Ollama, LLM Gateways
Once the model fits in memory, something has to serve it over an API. The vanilla path of plain PyTorch or Hugging Face Transformers wastes hardware: serving one request at a time leaves 70 to 90 percent of GPU cores idle. Production serving engines fix that with smarter batching and memory management. Here is how the main options compare.
| Option | Best for | Key trait | OpenAI-compatible |
|---|---|---|---|
| vLLM | Production throughput | PagedAttention plus continuous batching, near drop-in for most Hugging Face models | Yes |
| TensorRT-LLM | Max throughput on NVIDIA | About 2 to 3 times the throughput, more setup | Via server wrapper |
| Hugging Face TGI | HF-native serving | Supported for local and deployed self-hosting | Yes |
| Ollama | Local, single node | One model on one machine with almost no setup | Yes |
| LLM gateway | Routing across engines and APIs | One interface in front of self-hosted plus frontier models | Yes |
The engines: vLLM, TGI and Ollama
vLLM is the default choice for serving at scale. It uses PagedAttention to manage the KV cache in non-contiguous pages, the way an operating system pages memory, which cuts fragmentation and lets you run much larger batches. It is the easiest drop-in replacement for standard inference and exposes an OpenAI-compatible server, so existing clients keep working. TensorRT-LLM can push 2 to 3 times the throughput on NVIDIA hardware, at the cost of more setup. Hugging Face TGI is a solid HF-native server for both local and deployed setups. Ollama is the simple end of the spectrum: a single model on a single machine, ideal for local development or one box, not for high-concurrency production.
The fastest way to actually get a model running is Ollama. Install it, pull a model, and chat with it in three commands. No API keys, no gated downloads, and it is the recommended starting point if this is your first local model.
On Windows, download the installer from ollama.com instead of the shell script. The pulled model name and tag may differ; check the Ollama model library.
When you outgrow a single box and need production throughput, move to vLLM, which exposes an OpenAI-compatible server you can scale and put a gateway in front of.
Commands are illustrative. Check the current vLLM docs for exact flags and your chosen model identifier.
Two failures hit almost everyone on the first vLLM run. First, gated weights: a repo like meta-llama/Llama-3-8B-Instruct sits behind a Hugging Face access wall, so the download returns a 401 until you request access on the model card and pass a Hugging Face token, or pull an open mirror that needs no approval. Second, out-of-memory: an 8B model in 16-bit wants roughly 16GB of VRAM, so a smaller card will OOM. Drop to a quantized build, lower --max-model-len, or move to a larger GPU.
The gateway layer
A gateway is a reverse proxy that sits between your application and a mix of self-hosted engines and frontier APIs, exposing one OpenAI-compatible interface. Tools such as LiteLLM, OpenRouter, Portkey and Kong translate every upstream model into a standard response shape, retry failed calls with exponential backoff, fall back to a secondary model or provider, enforce per-key rate limits and spend budgets, centralize observability into stacks like Langfuse, MLflow, Helicone or OpenTelemetry, redact sensitive data, and cache repeated requests. The practical payoff is routing: send lightweight high-volume work to a cheap self-hosted model and route complex multi-step reasoning to a frontier API, all behind the same endpoint. See our LLM gateways hub for the routing tools in depth.
Quantization and Memory Footprint
Quantization shrinks a model by storing its weights, and often its KV cache, in lower precision. Moving from FP32 to INT8 or FP16 cuts memory needs by 2 to 4 times with minimal quality impact, which is why most production deployments quantize by default. At scale, quantization has cut inference cost (the expense of running a trained model to get outputs) by up to 40 percent, and one production case dropped from $0.10 to $0.06 per request after combining quantization with a serverless runtime.
Lower precision is what turns a model you cannot afford into one you can. A 70B model that needs roughly 140GB at 16-bit drops to roughly 35GB at 4-bit by the rule of thumb above, which is the difference between a multi-GPU cluster and a single high-memory card. Runtimes lean on formats and frameworks such as GGUF, ONNX and TensorRT to run those quantized weights efficiently across different hardware.
| Precision | Bytes / param | 7B (approx) | 70B (approx) | Note |
|---|---|---|---|---|
| FP32 | 4 | ~28GB | ~280GB | Rarely needed for inference |
| FP16 / BF16 | 2 | ~14GB | ~140GB | Common full-quality baseline |
| INT8 | 1 | ~7GB | ~70GB | 2 to 4x memory cut, minimal impact |
| 4-bit | ~0.5 | ~4GB | ~35GB | Biggest savings, test quality on task |
An honest caveat: the gigabyte figures above are a planning rule of thumb (parameters times bytes per parameter, plus overhead), not measured results, and they exclude the KV cache that grows with context length and batch size. We also do not claim exact quality differences between formats like GPTQ, AWQ and GGUF, since those format-level comparisons are beyond the scope here. Benchmark your own task before committing to a precision.
Token Management and Throughput
A token is roughly 4 characters of English text, about three quarters of a word, so a 100-word paragraph is around 133 tokens. Tokens are the unit of both cost and speed, which makes managing them the core of running a model well. Inference runs in two phases. Prefill processes all input tokens in parallel and is fast. Decode generates output tokens one at a time and is slow, because the bottleneck is memory bandwidth: the GPU re-reads the weights and the KV cache for every single token it produces. The practical consequence is that trimming output length helps latency more than trimming input.
Throughput, measured in tokens per second or queries per second, comes from batching. Without it, 70 to 90 percent of the GPU sits idle. Two patterns matter. Dynamic batching collects requests over a short window, for example 100ms, or until a target batch size, then processes them together. Continuous batching goes further and schedules at the iteration level, so a slow request does not block faster ones behind it. Larger batches raise throughput but add roughly 20 percent latency, so there is a dial to tune against your latency target.
The KV cache is where token management meets memory. It can consume 40 to 60 percent of GPU memory on long-context work, and when it overflows, throughput collapses. The fixes are storing the cache in lower precision (INT8 or FP8) and using PagedAttention to manage it without fragmentation. The single most effective token technique, though, is prompt caching.
Prompt caching reuses the computed KV tensors of a static prefix, so a repeated system prompt or tool definition skips the expensive prefill phase. The rule that makes it work: order content most stable to least stable, because changing any block invalidates that block and everything after it. Put tool definitions and the system prompt first, and the live user query last. For multi-step agent loops, this is not optional. It is the difference between a workable cost and a runaway one.
Closing the Frontier Gap (Fine-Tuning, RAG, Routing)
A general open model will not match a frontier model out of the box. The point of self-hosting is not to win on raw intelligence, it is to win on your specific task at your specific cost. Three techniques close most of the gap, and they stack.
The pattern that works in practice is a fine-tuned small model, grounded with RAG over your own data, served behind a gateway that routes the genuinely hard queries elsewhere. For a concrete example, a self-hosted Llama 3-8B runs around $0.20 per 1M tokens, versus $15 to $30 per 1M input tokens for a frontier API. If you can do the job with the 8B on most traffic, the savings are the whole reason to self-host. To pick a base model, start with our guide to the best open-source AI models and the open-source vs frontier total cost of ownership breakdown.
Lifecycle: Updates, Eval and Monitoring
Shipping a served model is the start, not the finish. Open models update often, your data drifts, and a quantization that looked fine in a demo can underperform under real traffic. Three lifecycle habits keep the system honest.
Updates. Providers change weights and pricing continuously. Gateways offer stable aliases, for example a name like chat-latest, that point to the newest release so your application code does not change when you upgrade. Pin a version when you need reproducibility, and use the alias when you want the latest.
Evaluation. Hold a representative test set and score every candidate model and quantization against it before promotion. Gateways add a network-layer check: Portkey ships eval templates and Kong has a model-as-judge plugin that rates completions inline. This is a gateway-level check, not an endorsement of a specific open-source eval framework, so treat that choice as your own decision.
Monitoring. Centralize traces, error rates, latencies and token counts. Audit logs and OpenTelemetry span attributes give you the per-request view you need to catch a regression before users do, and to attribute cost back to the workload that caused it.
The hardest lifecycle decision is admitting when self-hosting is the wrong call. These are the cases where a managed or frontier API wins, and pretending otherwise just buys you an outage.