Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Open Source

How to Run Open-Source LLMs: Prerequisites, Serving and Token Management

Guide. Figures verified June 30, 2026. GPU and cost inputs are living-data and re-verified quarterly or on a major open-model release.

Running an open-source model is not one decision. It is a stack: the GPU you put it on, the engine that serves it, the precision you quantize it to, the tokens you push through it, and the lifecycle you keep it healthy in. Get those right and a fine-tuned 8B model can do real work for a fraction of frontier-API cost. Get them wrong and you pay for idle GPUs while a managed API would have been cheaper and calmer.

The honest version: self-hosting open weights gives you data control and low per-token cost at steady volume. It also hands you an operations job. This guide names both sides, and points out where a managed or frontier API is simply the better call.


Prerequisites: Hardware, GPU and MLOps

The first question is whether to self-host at all. If you have steady, high-volume traffic and a reason to keep data in your own environment, self-hosting an open-weight model (one whose trained weights you can download, though its license may still restrict how you use them) pays off. If your volume is spiky or small, a metered API is usually cheaper and far less work. Decide that before you buy a GPU. If you are still building the business case, our breakdown of why teams choose open-source AI models covers the control, cost, and data-residency reasons that justify the operational effort.

When you do self-host, model weights load into GPU memory (VRAM) at startup, and the size of the model sets the floor for what hardware you need. The grounded sizing is qualitative: small models in the 1B to 7B range run on consumer GPUs or newer CPUs, medium models from 7B to 70B run on mid-tier cards such as an NVIDIA A10 or L4, and large models of 70B and up need high-end A100 or H200 class hardware.

For a back-of-envelope number, a common rule of thumb (a planning approximation, not a precise figure) is that weight memory in gigabytes is roughly the parameter count in billions times the bytes per parameter: about 2 bytes at 16-bit and about half a byte at 4-bit. Add 20 to 30 percent on top for the KV cache (the cached attention keys and values that grow with context length and can consume 40 to 60 percent of GPU memory) and runtime overhead. So a 7B model wants roughly 14GB at 16-bit or about 4GB at 4-bit, and a 70B model wants roughly 140GB at 16-bit or about 35GB at 4-bit. Treat those as starting points and measure your own workload.

40-60%
GPU memory the KV cache can use on long context
70-90%
GPU cores idle without batching
$1.5K-2.2K
Per H100, per month, running 24/7
2-4x
Memory cut from INT8 or FP16 vs FP32

The cost picture is concrete. A single NVIDIA H100 rents for about $2 to $3 per hour on the major clouds, which is roughly $1,500 to $2,200 per month if you keep it running around the clock. A production cluster for a 70B model often needs 4 to 8 GPUs, pushing infrastructure cost to $6,000 to $17,000 or more per month. That is before you account for the people who keep it running. These are living-data figures: GPU rental and per-token rates move with provider, region, and contract, so verify the current numbers on the provider's pricing page before committing budget.

The MLOps side is the part people underestimate. You need someone who can provision GPUs, manage drivers and CUDA, run a serving engine, watch memory, batch requests, roll out model updates, and respond when a node falls over. That is the real prerequisite. Use the checklist below to gauge readiness honestly.

Self-Hosting Readiness Checklist
Steady, high-volume traffic. Self-hosting only beats a metered API when your GPUs stay busy. Spiky or low volume favors a managed API.
GPU access sized to the model. Consumer cards for 1B to 7B, mid-tier A10 or L4 for 7B to 70B, A100 or H200 class for 70B and up.
Someone to own MLOps. Drivers, CUDA, the serving engine, autoscaling, monitoring, and incident response are an ongoing job, not a one-time setup.
License clarity. Confirm whether your model is OSI open-source, open-weight under a restrictive community license, or source-available. Read the actual model card before you build on it.
A real data reason. Private VPC, data residency, or workload isolation. If you do not have one, the operational case for self-hosting weakens.

Ways to Serve: vLLM, TGI, Ollama, LLM Gateways

Once the model fits in memory, something has to serve it over an API. The vanilla path of plain PyTorch or Hugging Face Transformers wastes hardware: serving one request at a time leaves 70 to 90 percent of GPU cores idle. Production serving engines fix that with smarter batching and memory management. Here is how the main options compare.

OptionBest forKey traitOpenAI-compatible
vLLMProduction throughputPagedAttention plus continuous batching, near drop-in for most Hugging Face modelsYes
TensorRT-LLMMax throughput on NVIDIAAbout 2 to 3 times the throughput, more setupVia server wrapper
Hugging Face TGIHF-native servingSupported for local and deployed self-hostingYes
OllamaLocal, single nodeOne model on one machine with almost no setupYes
LLM gatewayRouting across engines and APIsOne interface in front of self-hosted plus frontier modelsYes

The engines: vLLM, TGI and Ollama

vLLM is the default choice for serving at scale. It uses PagedAttention to manage the KV cache in non-contiguous pages, the way an operating system pages memory, which cuts fragmentation and lets you run much larger batches. It is the easiest drop-in replacement for standard inference and exposes an OpenAI-compatible server, so existing clients keep working. TensorRT-LLM can push 2 to 3 times the throughput on NVIDIA hardware, at the cost of more setup. Hugging Face TGI is a solid HF-native server for both local and deployed setups. Ollama is the simple end of the spectrum: a single model on a single machine, ideal for local development or one box, not for high-concurrency production.

The fastest way to actually get a model running is Ollama. Install it, pull a model, and chat with it in three commands. No API keys, no gated downloads, and it is the recommended starting point if this is your first local model.

Quick start: Ollama (beginner path)# Install Ollama (macOS / Linux), then pull and run a model curl -fsSL https://ollama.com/install.sh | sh ollama pull llama3.1:8b ollama run llama3.1:8b # Ollama also exposes an OpenAI-compatible API at http://localhost:11434/v1

On Windows, download the installer from ollama.com instead of the shell script. The pulled model name and tag may differ; check the Ollama model library.

When you outgrow a single box and need production throughput, move to vLLM, which exposes an OpenAI-compatible server you can scale and put a gateway in front of.

Production path: vLLM OpenAI-compatible server# Start an OpenAI-compatible endpoint for an open-weight model python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3-8B-Instruct \ --max-model-len 8192 # Call it with the standard OpenAI client by pointing base_url # at http://localhost:8000/v1

Commands are illustrative. Check the current vLLM docs for exact flags and your chosen model identifier.

Two failures hit almost everyone on the first vLLM run. First, gated weights: a repo like meta-llama/Llama-3-8B-Instruct sits behind a Hugging Face access wall, so the download returns a 401 until you request access on the model card and pass a Hugging Face token, or pull an open mirror that needs no approval. Second, out-of-memory: an 8B model in 16-bit wants roughly 16GB of VRAM, so a smaller card will OOM. Drop to a quantized build, lower --max-model-len, or move to a larger GPU.

The gateway layer

A gateway is a reverse proxy that sits between your application and a mix of self-hosted engines and frontier APIs, exposing one OpenAI-compatible interface. Tools such as LiteLLM, OpenRouter, Portkey and Kong translate every upstream model into a standard response shape, retry failed calls with exponential backoff, fall back to a secondary model or provider, enforce per-key rate limits and spend budgets, centralize observability into stacks like Langfuse, MLflow, Helicone or OpenTelemetry, redact sensitive data, and cache repeated requests. The practical payoff is routing: send lightweight high-volume work to a cheap self-hosted model and route complex multi-step reasoning to a frontier API, all behind the same endpoint. See our LLM gateways hub for the routing tools in depth.

If you want
A model on your laptop
Start with Ollama. Single node, minimal setup, OpenAI-compatible so you can graduate later without rewriting clients.
If you want
Throughput in production
Run vLLM (or TGI) behind an autoscaler. Continuous batching and PagedAttention are what keep your GPUs busy.
If you want
Self-hosted plus frontier together
Put a gateway in front. Route cheap traffic to your own model, hard traffic to a frontier API, with fallback and one bill.

Quantization and Memory Footprint

Quantization shrinks a model by storing its weights, and often its KV cache, in lower precision. Moving from FP32 to INT8 or FP16 cuts memory needs by 2 to 4 times with minimal quality impact, which is why most production deployments quantize by default. At scale, quantization has cut inference cost (the expense of running a trained model to get outputs) by up to 40 percent, and one production case dropped from $0.10 to $0.06 per request after combining quantization with a serverless runtime.

Lower precision is what turns a model you cannot afford into one you can. A 70B model that needs roughly 140GB at 16-bit drops to roughly 35GB at 4-bit by the rule of thumb above, which is the difference between a multi-GPU cluster and a single high-memory card. Runtimes lean on formats and frameworks such as GGUF, ONNX and TensorRT to run those quantized weights efficiently across different hardware.

PrecisionBytes / param7B (approx)70B (approx)Note
FP324~28GB~280GBRarely needed for inference
FP16 / BF162~14GB~140GBCommon full-quality baseline
INT81~7GB~70GB2 to 4x memory cut, minimal impact
4-bit~0.5~4GB~35GBBiggest savings, test quality on task

An honest caveat: the gigabyte figures above are a planning rule of thumb (parameters times bytes per parameter, plus overhead), not measured results, and they exclude the KV cache that grows with context length and batch size. We also do not claim exact quality differences between formats like GPTQ, AWQ and GGUF, since those format-level comparisons are beyond the scope here. Benchmark your own task before committing to a precision.


Token Management and Throughput

A token is roughly 4 characters of English text, about three quarters of a word, so a 100-word paragraph is around 133 tokens. Tokens are the unit of both cost and speed, which makes managing them the core of running a model well. Inference runs in two phases. Prefill processes all input tokens in parallel and is fast. Decode generates output tokens one at a time and is slow, because the bottleneck is memory bandwidth: the GPU re-reads the weights and the KV cache for every single token it produces. The practical consequence is that trimming output length helps latency more than trimming input.

Throughput, measured in tokens per second or queries per second, comes from batching. Without it, 70 to 90 percent of the GPU sits idle. Two patterns matter. Dynamic batching collects requests over a short window, for example 100ms, or until a target batch size, then processes them together. Continuous batching goes further and schedules at the iteration level, so a slow request does not block faster ones behind it. Larger batches raise throughput but add roughly 20 percent latency, so there is a dial to tune against your latency target.

The KV cache is where token management meets memory. It can consume 40 to 60 percent of GPU memory on long-context work, and when it overflows, throughput collapses. The fixes are storing the cache in lower precision (INT8 or FP8) and using PagedAttention to manage it without fragmentation. The single most effective token technique, though, is prompt caching.

Prompt caching reuses the computed KV tensors of a static prefix, so a repeated system prompt or tool definition skips the expensive prefill phase. The rule that makes it work: order content most stable to least stable, because changing any block invalidates that block and everything after it. Put tool definitions and the system prompt first, and the live user query last. For multi-step agent loops, this is not optional. It is the difference between a workable cost and a runaway one.


Closing the Frontier Gap (Fine-Tuning, RAG, Routing)

A general open model will not match a frontier model out of the box. The point of self-hosting is not to win on raw intelligence, it is to win on your specific task at your specific cost. Three techniques close most of the gap, and they stack.

Fine-tuning and LoRA
Retrain a small model on your data and it can beat a generic 70B on the narrow task you care about. LoRA updates only 1 to 10 percent of parameters, cutting training compute by 90 to 99 percent while keeping most of the quality.
Inference cost down 10 to 20x
RAG
Retrieve the few relevant documents and feed them as context, so a small model reasons over your data instead of memorizing it. This lets you deploy models 5 to 10 times smaller for comparable quality.
Per-request cost down 30 to 50%
Routing
Not every query needs your biggest model. A gateway can send simple requests to a cheap self-hosted model and reserve a frontier API for the hard ones, so you only pay top rates when the work demands it.
Right model per request
Prompt caching
Reuse the KV tensors of a stable prefix to skip prefill on repeated context. Pairs with all of the above and is the biggest single lever for agentic, multi-step workloads.
Input cost down up to 90%

The pattern that works in practice is a fine-tuned small model, grounded with RAG over your own data, served behind a gateway that routes the genuinely hard queries elsewhere. For a concrete example, a self-hosted Llama 3-8B runs around $0.20 per 1M tokens, versus $15 to $30 per 1M input tokens for a frontier API. If you can do the job with the 8B on most traffic, the savings are the whole reason to self-host. To pick a base model, start with our guide to the best open-source AI models and the open-source vs frontier total cost of ownership breakdown.


Lifecycle: Updates, Eval and Monitoring

Shipping a served model is the start, not the finish. Open models update often, your data drifts, and a quantization that looked fine in a demo can underperform under real traffic. Three lifecycle habits keep the system honest.

Updates. Providers change weights and pricing continuously. Gateways offer stable aliases, for example a name like chat-latest, that point to the newest release so your application code does not change when you upgrade. Pin a version when you need reproducibility, and use the alias when you want the latest.

Evaluation. Hold a representative test set and score every candidate model and quantization against it before promotion. Gateways add a network-layer check: Portkey ships eval templates and Kong has a model-as-judge plugin that rates completions inline. This is a gateway-level check, not an endorsement of a specific open-source eval framework, so treat that choice as your own decision.

Monitoring. Centralize traces, error rates, latencies and token counts. Audit logs and OpenTelemetry span attributes give you the per-request view you need to catch a regression before users do, and to attribute cost back to the workload that caused it.

The hardest lifecycle decision is admitting when self-hosting is the wrong call. These are the cases where a managed or frontier API wins, and pretending otherwise just buys you an outage.

No team to run it
If you do not have someone to own GPUs, drivers, scaling and incidents, a managed API removes the entire operations burden. Zero-ops is a feature, not a compromise.
Global scale and low latency
A managed edge gateway adds only about 20 to 40ms and handles spikes you would otherwise overprovision for. Matching that with self-hosted regional clusters is expensive and slow to build.
Attack surface
A self-hosted proxy that holds every provider key widens your threat profile. Self-hosted LLM proxies have shipped critical CVEs, including remote code execution (CVE-2024-6825 in LiteLLM). If you self-host one, patch on a schedule and watch advisories.

Guide Progress
0 of 7 sections complete

Frequently Asked Questions

What hardware do I need to run an open-source LLM?
It depends on model size and precision. A common rule of thumb is that weight memory in gigabytes is roughly the parameter count in billions times the bytes per parameter (about 2 bytes at 16-bit, about half a byte at 4-bit), plus 20 to 30 percent for the KV cache and overhead. In practice, sources put small models (1B to 7B) on consumer GPUs or newer CPUs, medium models (7B to 70B) on mid-tier cards like an A10 or L4, and large models (70B and up) on high-end A100 or H200 class hardware.
vLLM vs Ollama: which one should I use?
Use Ollama when you want a single model on one machine with almost no setup, for local development or a single box. Use vLLM when you need production throughput: it uses PagedAttention and continuous batching, is a near drop-in for most Hugging Face models, and exposes an OpenAI-compatible server you can put a gateway in front of.
How much does self-hosting cost compared with a frontier API?
A single NVIDIA H100 rents for about $2 to $3 per hour, roughly $1,500 to $2,200 per month running 24/7. A 70B production cluster often needs 4 to 8 GPUs, so $6,000 to $17,000 or more per month. On a per-token basis, self-hosted Llama 3-8B runs around $0.20 per 1M tokens versus $15 to $30 per 1M input tokens for a frontier API. Self-hosting only pays off at steady, high utilization.
Does quantization hurt model quality?
Moving from FP32 to INT8 or FP16 cuts memory needs by 2 to 4 times with minimal quality impact, and at scale quantization has cut inference cost by up to 40 percent. Lower precisions trade some accuracy for memory, so test on your own task. We do not assert exact quality deltas between GPTQ, AWQ and GGUF, since those format-level comparisons are beyond the scope here.
When should I not self-host an open-source model?
Reach for a managed frontier API when you have no team to run GPUs and Kubernetes, when you need global low-latency scale (a managed edge gateway adds only about 20 to 40ms), or when you want a smaller attack surface. A self-hosted proxy that holds every provider key widens your threat profile, and self-hosted LLM proxies have shipped critical CVEs, including remote code execution. Top-end multimodal or maximum reasoning at any cost can also still favor a frontier model.

Troubleshooting common issues

The weights fit but the KV cache does not. On long context it can take 40 to 60 percent of GPU memory. Lower the max context length, reduce batch size, enable KV cache quantization (INT8 or FP8), or use a PagedAttention engine like vLLM that manages the cache without fragmentation.
You are probably serving requests one at a time, which leaves 70 to 90 percent of cores idle. Switch to an engine with continuous batching (vLLM, TGI, TensorRT-LLM). Expect larger batches to add roughly 20 percent latency, so tune batch size against your latency target.
Turn on prompt caching and order your prompt most stable to least stable. Put tool definitions and the system prompt first and dynamic working memory last. One agent went from a 7 to an 84 percent cache hit rate this way, cutting cost 59 percent. Changing any block invalidates everything after it, so keep the volatile parts at the tail.
Lower precision trades accuracy for memory. Move up a precision (4-bit to INT8, or INT8 to FP16) and re-test on your own held-out set, not a generic benchmark. INT8 or FP16 usually keeps quality within a small margin while still cutting memory 2 to 4 times versus FP32.
Fact-checked against vendor documentation and official sources, June 2026. GPU and per-token costs vary by provider and region. Verify current pricing before committing budget.
NVIDIA, H100, A100, H200, A10 and L4 are trademarks of NVIDIA Corporation. Llama is a trademark of Meta Platforms. Mistral is a trademark of Mistral AI. vLLM, Hugging Face, TGI, Ollama, LiteLLM, OpenRouter, Portkey and Kong are the property of their respective owners. All product names are used for identification only and do not imply endorsement.
Before You Use AI
Your Privacy

Self-hosting open-weight models on your own GPUs means no prompt or document leaves your environment, which is the main privacy reason to run them. Using a managed API or a gateway that forwards to a third party means data is processed on that provider's infrastructure, so read its data-processing terms.

For strict data-residency or regulated workloads, on-premise or private-VPC deployment of an open model is the alternative to a hosted API.

Mental Health & AI Use

Open-source models are tools for building products, not substitutes for professional support. If you are experiencing distress, please reach out to trained professionals:

  • 988 Suicide & Crisis Lifeline: Call or text 988
  • SAMHSA Helpline: 1-800-662-4357
  • Crisis Text Line: Text HOME to 741741

AI systems can produce plausible-sounding but incorrect guidance. For mental health, medical, legal, or financial decisions, always consult a qualified professional.

Your Rights & Our Transparency

Under GDPR and CCPA, you have the right to access, correct, and delete personal data. Tech Jacks Solutions does not sell personal data. This article is independently produced. We have no affiliate relationship with the vendors or open-source projects named here.

AI systems referenced in this article are subject to the EU AI Act and applicable national regulations. Cost and performance figures are sourced from provider documentation and research current as of June 2026 and are version-stamped because they change.