How to Run Llama Locally: Complete Self-Hosting Guide (2026)
Last verified: May 7, 2026 · Format: Guide · Est. time: 20-25 min
Running Meta's Llama models on your own hardware gives you full control over data privacy, eliminates per-token API costs, and lets you fine-tune on proprietary datasets without sending anything to a third party. The tradeoff is hardware: you need to match the right GPU, RAM, and storage to the model size you want to run.
This guide walks through the complete process, from checking whether your hardware qualifies to running your first local inference. We cover three deployment paths: Ollama for the fastest setup, llama.cpp for maximum control over quantization, and vLLM for production serving. By the end, you will have a working local Llama instance and a clear understanding of when self-hosting makes financial and operational sense versus using a managed API.
What You Need Before Starting
Before installing any software, verify your system meets the minimum hardware requirements for the Llama model size you plan to run. The single biggest factor is GPU VRAM: without sufficient video memory, you will need to rely on CPU inference (much slower) or use aggressive quantization that sacrifices output quality.
VRAM Requirements by Model
| Model | Parameters | VRAM (INT4) | VRAM (FP16) | Context |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | 5 GB | 16 GB | 128K |
| Llama 3.3 70B | 70B | 38 GB | 140 GB | 128K |
| Llama 4 Scout | 109B (17B active) | 58 GB | 218 GB | 10M |
| Llama 4 Maverick | 400B (17B active) | 206 GB | 800 GB | 1M |
VRAM data: Onyx AI Self-Hosted LLM Leaderboard, March 2026. FP16 = 2 bytes/param, INT4 = 0.5 bytes/param. Actual usage 10-20% higher due to KV cache and overhead.
System Requirements
- GPU: NVIDIA RTX 3060+ (8 GB VRAM) for 8B models; RTX 4090 or A100 for 70B; H100 cluster for Llama 4
- RAM: 16 GB minimum for 8B; 64-128 GB for 70B; 200 GB+ for 405B/Llama 4
- Storage: 8-16 GB for 8B; 32-64 GB for 70B; 200 GB+ for enterprise models
- OS: Linux (Ubuntu 22.04+), macOS 13+, or Windows 10/11 with WSL2
- CUDA: NVIDIA drivers + CUDA toolkit for GPU acceleration
nvidia-smi)- ✓Step 1: Choose Your Model
- ✓Step 2: Install Ollama
- ✓Step 3: Build with llama.cpp
- ✓Step 4: Deploy with vLLM
- ✓Step 5: Docker Deployment
- ✓Step 6: Tune Performance
- ✓Step 7: Self-Host vs API Decision
License note: The Llama Community License prohibits using model outputs to train competing LLMs and requires a separate commercial license for products exceeding 700 million monthly active users, according to the license agreement. The Open Source Initiative does not recognize Llama's license as meeting the open-source definition due to these restrictions.
Step 1: Choose Your Model
Select the Llama variant that matches your hardware and use case. Meta currently maintains four active model generations.
For Learning and Prototyping
Llama 3.1 8B or Llama 3.2 3B run on consumer hardware with 8 GB+ VRAM and deliver solid general-purpose performance. The 8B variant scores 48.3 on MMLU-Pro and 80.4 on IFEval, according to the Onyx AI leaderboard. These are the models to start with if you are new to local inference.
For Production Workloads
Llama 3.3 70B is an instruct-tuned model with a 128K context window that scores 68.9 on MMLU-Pro and 92.1 on IFEval, according to the Onyx AI leaderboard. It requires workstation-class hardware but delivers strong reasoning and instruction-following capabilities.
For Enterprise and Research
Llama 4 Scout (109B total, 17B active per token) uses a Mixture-of-Experts architecture with 16 experts and a 10-million token context window, according to Meta. It fits on a single NVIDIA H100 GPU using INT4 quantization. Llama 4 Maverick (400B total, 17B active per token) uses 128 experts with a 1-million token context window and requires a multi-GPU cluster, according to Meta.
Recommendation: Start with Llama 3.1 8B via Ollama. It runs on most modern GPUs, downloads in minutes, and gives you a working mental model of local inference before scaling up.
Step 2: Install Ollama (Fastest Path)
Ollama is the easiest way to get Llama running locally. It handles model downloading, quantization, and serving in a single tool with a built-in RESTful API. Ollama provides optimized model loading, memory management, and cross-platform compatibility, according to Local AI Zone.
Installation
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com and run the setup wizard.
Download and Run a Model
# Download and start Llama 3.1 8B (defaults to 4-bit quantization)
ollama run llama3.1
# List all downloaded models
ollama list
# Pull a specific model without starting chat
ollama pull llama3.1
# Remove a model from disk
ollama rm llama3.1
Ollama defaults to 4-bit quantization, so actual VRAM usage is closer to the INT4 figures in the table above rather than FP16.
Verify the Setup
After running ollama run llama3.1, type a test query at the prompt:
>>> What is GGUF quantization?
If you receive a coherent response within a few seconds, your setup is working correctly.
Use the Ollama API
Ollama serves an API on http://localhost:11434 by default. Test it with:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain the difference between INT4 and FP16 quantization in one paragraph."
}'
Verification: Run ollama list to confirm the model downloaded correctly. Check GPU usage with nvidia-smi to verify the model loaded onto your GPU rather than falling back to CPU.
Step 3: Build with llama.cpp (Advanced Users)
For maximum control over quantization, context length, and inference parameters, build and run llama.cpp directly. Created by Georgi Gerganov and released on March 10, 2023, llama.cpp is a C++ re-implementation that enables efficient inference on consumer CPUs and GPUs. The project introduced the GGUF file format, a binary format that stores quantized tensors and metadata for cross-platform efficiency.
Compile from Source
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# CPU-only build
make
# NVIDIA GPU build (requires CUDA toolkit)
make LLAMA_CUDA=1
Download GGUF Models
GGUF-format models are available on Hugging Face. Key quantization options:
- Q8_0: ~1 byte/param, 95%+ quality retention. Best for research and professional applications.
- Q4_K_M: ~0.5 bytes/param, 85-90% quality retention. Most popular for general use (recommended).
- Q2_K: ~0.25 bytes/param, 70-80% quality retention. For extremely constrained hardware.
Quality retention percentages: Local AI Zone (October 2025)
Run Inference
./llama-cli -m models/llama-3.1-8b-instruct-q4_k_m.gguf \
-p "Explain the difference between INT4 and FP16 quantization" \
-n 512 \
--ctx-size 4096
Key parameters:
-m: Path to the GGUF model file-p: Your prompt text-n: Maximum tokens to generate--ctx-size: Context window size (higher values use more VRAM)
Step 4: Deploy with vLLM (Production Serving)
For production environments requiring high throughput and an OpenAI-compatible API, vLLM provides optimized serving. According to Local AI Zone, vLLM includes PagedAttention for efficient memory management, continuous batching for improved throughput, and tensor parallelism for large model deployment. According to Meta, vLLM was one of the key community projects they partnered with to ensure production deployment readiness for Llama 3.1.
Installation
pip install vllm
Start the API Server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 1
This spins up an OpenAI-compatible endpoint on http://localhost:8000. You can query it with standard OpenAI SDK calls, making vLLM a drop-in replacement for API-based workflows.
Multi-GPU for Larger Models
# 70B model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.3-70B-Instruct \
--tensor-parallel-size 2
Step 5: Docker Deployment
Docker provides consistent deployment across environments and simplifies GPU passthrough. According to Local AI Zone, containerization is recommended for production deployments.
Ollama via Docker
# CPU-only deployment
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama
# With NVIDIA GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama
vLLM via Docker
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct
Tip: For Docker with NVIDIA GPUs, ensure the nvidia-container-toolkit package is installed and configured. Verify with docker run --gpus all nvidia/cuda:12.0-base nvidia-smi.
Step 6: Tune Performance
Once your model is running, adjust these parameters to balance speed, quality, and resource usage.
Quantization Tradeoffs
| Quantization | Size/Param | Quality | Best For |
|---|---|---|---|
| FP16 (none) | 2 bytes | 100% | Research, maximum quality |
| Q8_0 | ~1 byte | 95%+ | Professional applications |
| Q4_K_M | ~0.5 bytes | 85-90% | General use (recommended) |
| Q2_K | ~0.25 bytes | 70-80% | Extremely limited hardware |
Source: Local AI Zone (October 2025)
Context Length and Memory
Increasing context length (--ctx-size in llama.cpp) consumes additional VRAM for the KV cache. Start with 4096 tokens and increase incrementally while monitoring GPU memory with nvidia-smi.
CPU Offloading
If your GPU VRAM is insufficient for the full model, both llama.cpp and Ollama support partial CPU offloading. In llama.cpp, use the --n-gpu-layers flag:
# Offload 20 transformer layers to GPU, rest on CPU
./llama-cli -m model.gguf -ngl 20 -p "Your prompt here"
More layers on GPU means faster inference but higher VRAM usage. Find the maximum number of layers your GPU can handle, then set -ngl accordingly.
Step 7: When to Self-Host vs Use Cloud API
Self-hosting is not always the right call. Use this framework to decide.
Self-Host When
- Regulated industries: Healthcare, finance, or defense environments requiring data sovereignty where data cannot leave your infrastructure
- High volume: Your usage exceeds approximately 50,000 requests per day at 1,000 tokens each, according to BenchLM estimates, where API costs begin to exceed fixed hardware costs
- Custom fine-tuning: You need to train on proprietary data that cannot be shared with API providers
- Airgapped operation: Environments with no internet access or strict network isolation requirements
Use Managed APIs When
- Low to moderate volume: Under approximately 1 million tokens per day
- Rapid prototyping: You need to iterate quickly without infrastructure setup
- No ML team: Your organization lacks dedicated machine learning infrastructure expertise
- Maximum quality: Proprietary models like GPT-4o or Claude still lead on some reasoning benchmarks
Cloud providers including AWS Bedrock, Google Vertex AI, and Azure AI offer managed Llama deployments that eliminate the hardware investment while maintaining access to the Llama model family.
Troubleshooting
--n-gpu-layers to offload some transformer layers to CPU. Verify current GPU memory usage with nvidia-smi. If running Ollama, try a smaller model variant first.ollama list to check available models. If the model is not listed, pull it with ollama pull llama3.1. Verify your internet connection and that the Ollama service is running in the background.nvidia-smi output and compare against vLLM's compatibility matrix. Ensure sufficient VRAM for the model plus 10-20% overhead for KV cache and framework internals.Running Llama locally is now accessible to anyone with a modern GPU. Ollama gets you from zero to inference in under five minutes. llama.cpp gives you granular control over quantization and memory allocation. vLLM provides the production-grade serving layer for team and enterprise deployments. The key decision is matching your model size to available hardware, using quantization to bridge the gap when full-precision weights exceed your VRAM budget.
Start with Llama 3.1 8B on Ollama to learn the workflow, then scale up to 70B or Llama 4 Scout as your hardware and requirements grow. The self-hosting ecosystem around Llama is mature, and switching between tools and quantization levels is straightforward, giving you the flexibility to optimize for cost, quality, or latency depending on your workload.
Self-hosted Llama models keep data on your infrastructure, but cloud API deployments send data to third-party servers. Free-tier API access may use your inputs for model improvement. Review Meta's Responsible Use Guide and your chosen provider's data retention policies before processing sensitive information.
Enterprise deployments should evaluate data isolation, encryption at rest, and audit logging against compliance requirements (HIPAA, SOC 2, GDPR).
AI models are tools, not therapists or companions. If you or someone you know is in crisis:
988 Suicide & Crisis Lifeline: Call or text 988
SAMHSA Helpline: 1-800-662-4357
Crisis Text Line: Text HOME to 741741
See the NIST AI Risk Management Framework for organizational AI governance guidance.
Under GDPR and CCPA, you have rights to access, correct, and delete your data. Check your deployment provider's data portability options.
TechJacks Solutions maintains editorial independence. This article was not sponsored or reviewed by Meta. TechJacks Solutions may earn referral fees from links to vendor products. These fees never influence editorial recommendations. For AI regulation context, see our EU AI Act overview.