Meta Llama

How to Run Llama Locally: Complete Self-Hosting Guide (2026)

Last verified: May 7, 2026 · Format: Guide · Est. time: 20-25 min

Running Meta's Llama models on your own hardware gives you full control over data privacy, eliminates per-token API costs, and lets you fine-tune on proprietary datasets without sending anything to a third party. The tradeoff is hardware: you need to match the right GPU, RAM, and storage to the model size you want to run.

This guide walks through the complete process, from checking whether your hardware qualifies to running your first local inference. We cover three deployment paths: Ollama for the fastest setup, llama.cpp for maximum control over quantization, and vLLM for production serving. By the end, you will have a working local Llama instance and a clear understanding of when self-hosting makes financial and operational sense versus using a managed API.

5 GB

VRAM to run Llama 3.1 8B at INT4 quantization

Source: Onyx AI Self-Hosted LLM Leaderboard (March 2026)

38 GB

VRAM for Llama 3.3 70B at INT4 quantization

Source: Onyx AI Self-Hosted LLM Leaderboard (March 2026)

85-90%

Quality retention at Q4 quantization vs full precision

Source: Local AI Zone (October 2025)

~50K

Requests/day break-even for self-hosting vs API

Source: BenchLM (2026)

What You Need Before Starting

Before installing any software, verify your system meets the minimum hardware requirements for the Llama model size you plan to run. The single biggest factor is GPU VRAM: without sufficient video memory, you will need to rely on CPU inference (much slower) or use aggressive quantization that sacrifices output quality.

VRAM Requirements by Model

Model	Parameters	VRAM (INT4)	VRAM (FP16)	Context
Llama 3.1 8B	8B	5 GB	16 GB	128K
Llama 3.3 70B	70B	38 GB	140 GB	128K
Llama 4 Scout	109B (17B active)	58 GB	218 GB	10M
Llama 4 Maverick	400B (17B active)	206 GB	800 GB	1M

VRAM data: Onyx AI Self-Hosted LLM Leaderboard, March 2026. FP16 = 2 bytes/param, INT4 = 0.5 bytes/param. Actual usage 10-20% higher due to KV cache and overhead.

System Requirements

GPU: NVIDIA RTX 3060+ (8 GB VRAM) for 8B models; RTX 4090 or A100 for 70B; H100 cluster for Llama 4
RAM: 16 GB minimum for 8B; 64-128 GB for 70B; 200 GB+ for 405B/Llama 4
Storage: 8-16 GB for 8B; 32-64 GB for 70B; 200 GB+ for enterprise models
OS: Linux (Ubuntu 22.04+), macOS 13+, or Windows 10/11 with WSL2
CUDA: NVIDIA drivers + CUDA toolkit for GPU acceleration

Prerequisites Checklist

✓

GPU with sufficient VRAM for your target model (8 GB+ for 8B, 24 GB+ per GPU for 70B)

✓

System RAM meets minimum (16 GB for 8B, 64 GB for 70B, 200 GB+ for enterprise)

✓

Free disk space for model weights (8 GB minimum for smallest quantized models)

✓

NVIDIA drivers and CUDA toolkit installed (verify with nvidia-smi)

✓

Meta Llama Community License accepted at llama.meta.com

✓

Python 3.10+ installed (required for vLLM and Hugging Face workflows)

0 of 6 complete

License note: The Llama Community License prohibits using model outputs to train competing LLMs and requires a separate commercial license for products exceeding 700 million monthly active users, according to the license agreement. The Open Source Initiative does not recognize Llama's license as meeting the open-source definition due to these restrictions.

Step 1: Choose Your Model

Select the Llama variant that matches your hardware and use case. Meta currently maintains four active model generations.

For Learning and Prototyping

Llama 3.1 8B or Llama 3.2 3B run on consumer hardware with 8 GB+ VRAM and deliver solid general-purpose performance. The 8B variant scores 48.3 on MMLU-Pro and 80.4 on IFEval, according to the Onyx AI leaderboard. These are the models to start with if you are new to local inference.

For Production Workloads

Llama 3.3 70B is an instruct-tuned model with a 128K context window that scores 68.9 on MMLU-Pro and 92.1 on IFEval, according to the Onyx AI leaderboard. It requires workstation-class hardware but delivers strong reasoning and instruction-following capabilities.

For Enterprise and Research

Llama 4 Scout (109B total, 17B active per token) uses a Mixture-of-Experts architecture with 16 experts and a 10-million token context window, according to Meta. It fits on a single NVIDIA H100 GPU using INT4 quantization. Llama 4 Maverick (400B total, 17B active per token) uses 128 experts with a 1-million token context window and requires a multi-GPU cluster, according to Meta.

Recommendation: Start with Llama 3.1 8B via Ollama. It runs on most modern GPUs, downloads in minutes, and gives you a working mental model of local inference before scaling up.

Step 2: Install Ollama (Fastest Path)

Ollama is the easiest way to get Llama running locally. It handles model downloading, quantization, and serving in a single tool with a built-in RESTful API. Ollama provides optimized model loading, memory management, and cross-platform compatibility, according to Local AI Zone.

Installation

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com and run the setup wizard.

Download and Run a Model

# Download and start Llama 3.1 8B (defaults to 4-bit quantization)
ollama run llama3.1

# List all downloaded models
ollama list

# Pull a specific model without starting chat
ollama pull llama3.1

# Remove a model from disk
ollama rm llama3.1

Ollama defaults to 4-bit quantization, so actual VRAM usage is closer to the INT4 figures in the table above rather than FP16.

Verify the Setup

After running ollama run llama3.1, type a test query at the prompt:

>>> What is GGUF quantization?

If you receive a coherent response within a few seconds, your setup is working correctly.

Use the Ollama API

Ollama serves an API on http://localhost:11434 by default. Test it with:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain the difference between INT4 and FP16 quantization in one paragraph."
}'

Verification: Run ollama list to confirm the model downloaded correctly. Check GPU usage with nvidia-smi to verify the model loaded onto your GPU rather than falling back to CPU.

FREE TEMPLATE

AI Risk Management Template

Identify, assess, and mitigate AI deployment risks

Download Free →

Step 3: Build with llama.cpp (Advanced Users)

For maximum control over quantization, context length, and inference parameters, build and run llama.cpp directly. Created by Georgi Gerganov and released on March 10, 2023, llama.cpp is a C++ re-implementation that enables efficient inference on consumer CPUs and GPUs. The project introduced the GGUF file format, a binary format that stores quantized tensors and metadata for cross-platform efficiency.

Compile from Source

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CPU-only build
make

# NVIDIA GPU build (requires CUDA toolkit)
make LLAMA_CUDA=1

Download GGUF Models

GGUF-format models are available on Hugging Face. Key quantization options:

Q8_0: ~1 byte/param, 95%+ quality retention. Best for research and professional applications.
Q4_K_M: ~0.5 bytes/param, 85-90% quality retention. Most popular for general use (recommended).
Q2_K: ~0.25 bytes/param, 70-80% quality retention. For extremely constrained hardware.

Quality retention percentages: Local AI Zone (October 2025)

Run Inference

./llama-cli -m models/llama-3.1-8b-instruct-q4_k_m.gguf \
  -p "Explain the difference between INT4 and FP16 quantization" \
  -n 512 \
  --ctx-size 4096

Key parameters:

-m: Path to the GGUF model file
-p: Your prompt text
-n: Maximum tokens to generate
--ctx-size: Context window size (higher values use more VRAM)

Step 4: Deploy with vLLM (Production Serving)

For production environments requiring high throughput and an OpenAI-compatible API, vLLM provides optimized serving. According to Local AI Zone, vLLM includes PagedAttention for efficient memory management, continuous batching for improved throughput, and tensor parallelism for large model deployment. According to Meta, vLLM was one of the key community projects they partnered with to ensure production deployment readiness for Llama 3.1.

Installation

pip install vllm

Start the API Server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1

This spins up an OpenAI-compatible endpoint on http://localhost:8000. You can query it with standard OpenAI SDK calls, making vLLM a drop-in replacement for API-based workflows.

Multi-GPU for Larger Models

# 70B model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2

Step 5: Docker Deployment

Docker provides consistent deployment across environments and simplifies GPU passthrough. According to Local AI Zone, containerization is recommended for production deployments.

Ollama via Docker

# CPU-only deployment
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# With NVIDIA GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

vLLM via Docker

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct

Tip: For Docker with NVIDIA GPUs, ensure the nvidia-container-toolkit package is installed and configured. Verify with docker run --gpus all nvidia/cuda:12.0-base nvidia-smi.

Step 6: Tune Performance

Once your model is running, adjust these parameters to balance speed, quality, and resource usage.

Quantization Tradeoffs

Quantization	Size/Param	Quality	Best For
FP16 (none)	2 bytes	100%	Research, maximum quality
Q8_0	~1 byte	95%+	Professional applications
Q4_K_M	~0.5 bytes	85-90%	General use (recommended)
Q2_K	~0.25 bytes	70-80%	Extremely limited hardware

Source: Local AI Zone (October 2025)

Context Length and Memory

Increasing context length (--ctx-size in llama.cpp) consumes additional VRAM for the KV cache. Start with 4096 tokens and increase incrementally while monitoring GPU memory with nvidia-smi.

CPU Offloading

If your GPU VRAM is insufficient for the full model, both llama.cpp and Ollama support partial CPU offloading. In llama.cpp, use the --n-gpu-layers flag:

# Offload 20 transformer layers to GPU, rest on CPU
./llama-cli -m model.gguf -ngl 20 -p "Your prompt here"

More layers on GPU means faster inference but higher VRAM usage. Find the maximum number of layers your GPU can handle, then set -ngl accordingly.

Step 7: When to Self-Host vs Use Cloud API

Self-hosting is not always the right call. Use this framework to decide.

Self-Host When

Regulated industries: Healthcare, finance, or defense environments requiring data sovereignty where data cannot leave your infrastructure
High volume: Your usage exceeds approximately 50,000 requests per day at 1,000 tokens each, according to BenchLM estimates, where API costs begin to exceed fixed hardware costs
Custom fine-tuning: You need to train on proprietary data that cannot be shared with API providers
Airgapped operation: Environments with no internet access or strict network isolation requirements

Use Managed APIs When

Low to moderate volume: Under approximately 1 million tokens per day
Rapid prototyping: You need to iterate quickly without infrastructure setup
No ML team: Your organization lacks dedicated machine learning infrastructure expertise
Maximum quality: Proprietary models like GPT-4o or Claude still lead on some reasoning benchmarks

Cloud providers including AWS Bedrock, Google Vertex AI, and Azure AI offer managed Llama deployments that eliminate the hardware investment while maintaining access to the Llama model family.

Troubleshooting

Common Issues

"CUDA out of memory" errors+

Reduce context length, switch to a smaller quantization (Q4 instead of Q8), or use --n-gpu-layers to offload some transformer layers to CPU. Verify current GPU memory usage with nvidia-smi. If running Ollama, try a smaller model variant first.

Slow generation speed on CPU+

CPU inference on models larger than 8B parameters will be noticeably slow. Use Q4_K_M or lower quantization and ensure your CPU supports AVX2 instructions for llama.cpp optimizations. Consider upgrading to a GPU with sufficient VRAM for your target model.

Model download failures or 403 errors+

Llama model download links from Meta expire after 24 hours and have limited download counts, according to the Meta GitHub documentation. Re-request the signed URL from the Meta website if you encounter 403: Forbidden errors. For Hugging Face downloads, ensure you have accepted the license agreement on the model page.

Ollama: "model not found"+

Run ollama list to check available models. If the model is not listed, pull it with ollama pull llama3.1. Verify your internet connection and that the Ollama service is running in the background.

vLLM crashes on startup+

Verify your CUDA version matches vLLM's requirements. Check nvidia-smi output and compare against vLLM's compatibility matrix. Ensure sufficient VRAM for the model plus 10-20% overhead for KV cache and framework internals.

Running Llama locally is now accessible to anyone with a modern GPU. Ollama gets you from zero to inference in under five minutes. llama.cpp gives you granular control over quantization and memory allocation. vLLM provides the production-grade serving layer for team and enterprise deployments. The key decision is matching your model size to available hardware, using quantization to bridge the gap when full-precision weights exceed your VRAM budget.

Start with Llama 3.1 8B on Ollama to learn the workflow, then scale up to 70B or Llama 4 Scout as your hardware and requirements grow. The self-hosting ecosystem around Llama is mature, and switching between tools and quantization levels is straightforward, giving you the flexibility to optimize for cost, quality, or latency depending on your workload.

Video Resources

▶

Ollama Local LLM Setup Tutorial

Search on YouTube

▶

llama.cpp and GGUF Quantization Guide

Search on YouTube

▶

vLLM Production Serving for Llama

Search on YouTube

Gallery

Contacts

How to Run Llama Locally: Complete Self-Hosting Guide (2026)

What You Need Before Starting

VRAM Requirements by Model

System Requirements

Step 1: Choose Your Model

For Learning and Prototyping

For Production Workloads

For Enterprise and Research

Step 2: Install Ollama (Fastest Path)

Installation

Download and Run a Model

Verify the Setup

Use the Ollama API

Step 3: Build with llama.cpp (Advanced Users)

Compile from Source

Download GGUF Models

Run Inference

Step 4: Deploy with vLLM (Production Serving)

Installation

Start the API Server

Multi-GPU for Larger Models

Step 5: Docker Deployment

Ollama via Docker

vLLM via Docker

Step 6: Tune Performance

Quantization Tradeoffs

Context Length and Memory

CPU Offloading

Step 7: When to Self-Host vs Use Cloud API

Self-Host When

Use Managed APIs When

Troubleshooting

Services

Learn

Company