Inside Llama's Training and Fine-Tuning Process (2026)
Last verified: June 2026 · Format: Guide · Est. time: 18-22 min
Every Llama model you download is the product of two very different processes: a massive, expensive pretraining run that Meta performs on tens of thousands of GPUs, followed by a much lighter post-training stage that turns a raw next-token predictor into a helpful, safety-aligned assistant. Understanding where that boundary sits is the key to fine-tuning effectively, because the part you can realistically adapt on your own hardware is the second part, not the first.
This guide walks through the full pipeline as Meta documents it, from the raw token counts behind Llama 2, Llama 3.1, and Llama 4, through the post-training methods Meta uses to align each release, and into the practical techniques developers use to adapt these models: LoRA, QLoRA, and GGUF quantization. By the end you will know which fine-tuning method fits your GPU, how to prepare a run, and which known failure modes to test for before you ship anything.
What You Need Before You Start
You cannot reproduce Llama's pretraining at home, and you do not need to. Fine-tuning starts from Meta's released base or instruct weights and adapts them to your task. Before you begin, make five decisions: which adaptation method to use, what quantization to target, where to get Meta's reference scripts, how much GPU memory you have, and how you will measure success.
- ✓Step 1: Pick a method by available VRAM
- ✓Step 2: Prepare and format your data
- ✓Step 3: Run a LoRA or QLoRA pass
- ✓Step 4: Quantize to GGUF for serving
- ✓Step 5: Evaluate and test failure modes
How Meta Pretrains Llama
Pretraining is the stage where a model learns language, facts, and reasoning patterns by predicting the next token across trillions of words. This is the part that costs millions of dollars and occupies GPU clusters for weeks, and it is the part that scaled dramatically across Llama generations.
Llama 2 was trained on 2 trillion tokens in July 2023. Meta states it excluded Meta user data from the pretraining mix. Llama 3.1, released in July 2024, jumped to 15 trillion tokens and was trained on 16,000 NVIDIA H100 GPUs, with web data filtered by a quality classifier that was itself built on Llama 2.
Llama 4, released in April 2025, marks the largest architectural shift. Meta reports more than 30 trillion tokens of pretraining data and a move to a Mixture-of-Experts design with early-fusion multimodality, meaning text, image, and video tokens are processed together rather than bolted on afterward. Llama 4 supports 200 languages, uses FP8 precision during training, and was distilled from a 2-trillion-parameter teacher model called Behemoth, which Meta says was trained on 32,000 GPUs. Notably, the Llama 4 data mix now includes public Facebook and Instagram posts plus Meta AI interactions, a departure from the user-data exclusion Meta described for Llama 2.
Two token figures, two sources: Meta's own materials describe Llama 4 pretraining as "more than 30 trillion tokens" overall. Microsoft's model catalog lists roughly 40 trillion tokens for Llama 4 Scout and roughly 22 trillion for Maverick. Both figures are presented here as published; they describe different variants and different accounting, so treat the headline as a range rather than a single number.
Pretraining at a Glance
| Generation | Pretraining Tokens | Training Hardware | Post-Training Approach |
|---|---|---|---|
| Llama 2 (Jul 2023) | 2T | Not individually itemized; no Meta user data | SFT + rejection sampling + DPO/PPO |
| Llama 3.1 (Jul 2024) | 15T | 16,000 NVIDIA H100 (405B) | SFT + rejection sampling + DPO/PPO (RLHF-V1..V5) |
| Llama 4 (Apr 2025) | 30T+ (Meta) | FP8 MoE; Behemoth teacher on 32,000 GPUs | Lightweight SFT → online RL → lightweight DPO |
Token counts: Meta Llama documentation. Microsoft's catalog separately lists ~40T (Scout) / ~22T (Maverick) for Llama 4. Both figures published; presented as a range.
Post-Training: From Raw Model to Assistant
A freshly pretrained model can complete text but does not reliably follow instructions or refuse harmful requests. Post-training fixes that. This is also the conceptual blueprint your own fine-tuning imitates, just at a far smaller scale.
The Llama 2 and Llama 3.1 Recipe
For Llama 2 and Llama 3.1, Meta used a multi-stage alignment pipeline: supervised fine-tuning (SFT) on curated instruction data, rejection sampling to keep the best generated responses, and preference optimization through Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), two methods for teaching the model which answers people prefer, across several reward-model iterations of reinforcement learning from human feedback (RLHF) that Meta labels RLHF-V1 through RLHF-V5. Safety was reinforced through context distillation, where the model learns to internalize safe behavior demonstrated by a safety-prompted teacher.
The Llama 4 Revamp
Meta rebuilt the post-training pipeline for Llama 4 around a leaner sequence: a lightweight SFT pass, then online reinforcement learning, then a lightweight DPO pass. Crucially, Meta says it dropped more than 50% of the "easy" SFT examples on the theory that they added cost without improving capability, concentrating effort on harder cases. The smaller Llama 4 models were codistilled from the Behemoth teacher, inheriting its capabilities without its parameter count.
Why this matters for you: Your fine-tuning is essentially a compressed version of stage one, SFT, sometimes followed by a small DPO pass. You are not redoing pretraining; you are nudging an already-aligned model toward your domain. Keeping your dataset small, clean, and hard is the same lesson Meta applied when it cut easy SFT data.
Developer Fine-Tuning: LoRA and QLoRA
Full fine-tuning updates every weight in the model, which means you need enough memory to hold the model, its gradients, and optimizer states all at once. For anything above a small model, that is impractical on a single GPU. Parameter-efficient fine-tuning (PEFT) solves this.
LoRA: Freeze the Base, Train Small Adapters
LoRA (Low-Rank Adaptation) freezes the base model's weights entirely and injects small, low-rank trainable matrices into the layers. Only those adapter matrices are trained, which slashes the number of updated parameters by orders of magnitude. Because the base stays frozen, memory pressure drops sharply and you can keep multiple task-specific adapters for one shared base model.
QLoRA: Quantize First, Then Adapt
QLoRA goes further by quantizing the frozen base model to a lower precision before attaching the LoRA adapters. The combination is what makes single-GPU fine-tuning of a large model realistic: the quantized base fits in roughly 48 GB of VRAM, and the trainable adapters add little on top. For most developers without a multi-GPU cluster, QLoRA is the default path to adapting a big Llama model.
Step 1: Pick a Method by Available VRAM
Match the method to the GPU you actually have. The table below maps the three approaches to their memory profile and the situation each one suits.
| Method | VRAM Profile | Relative Speed | When to Use |
|---|---|---|---|
| Full fine-tuning | Highest; needs weights + gradients + optimizer states (often multi-GPU) | Slowest per result, most thorough | You have cluster-class hardware and need to change the model deeply |
| LoRA | Moderate; base frozen, only adapters trained | Fast | You have a mid-to-large GPU and want swappable task adapters |
| QLoRA | Lowest; quantized base fits a single ~48 GB GPU | Fast, slight overhead from quantization | You have one large-VRAM GPU and need to tune a big model |
Source: developer fine-tuning guides (PEFT/LoRA/QLoRA), 2026. Exact VRAM depends on model size, sequence length, and batch size.
Step 2: Prepare and Format Your Data
Quality beats quantity. Meta's own Llama 4 work showed that cutting easy examples and keeping hard ones improved results. Format your examples to match the model's expected instruction template, deduplicate aggressively, and hold out a representative slice for evaluation before you train on the rest.
Step 3: Run a LoRA or QLoRA Pass
Meta publishes reference fine-tuning scripts in its llama-models repository and the companion Llama Cookbook, which cover end-to-end PEFT workflows. For a PyTorch-native option, torchtune is the official PyTorch fine-tuning library and is a common starting point for LoRA and QLoRA recipes. Start from a working reference recipe rather than writing the training loop from scratch, then change only the data path, the adapter rank, and the learning rate for your first run.
Scope note: torchtune is the recognized PyTorch-native fine-tuning library, but its specific recipe details fall outside the primary vendor documentation cited for this guide. Treat the library as the official PyTorch-native option and confirm exact commands and config flags against its current documentation before you run.
Step 4: Quantize to GGUF for Serving
Once you have a fine-tuned model, quantization shrinks it so it fits and runs efficiently on the hardware you serve from. The dominant format for local serving is GGUF, produced and consumed through the llama.cpp ecosystem and tools built on it such as Ollama and LM Studio.
Quantization is a quality-versus-size tradeoff. At INT4, weights occupy roughly 0.5 bytes per parameter and third-party testing generally reports around 85 to 90 percent of full-precision performance retained. Stepping up to Q8 lifts that to roughly 95 percent or better at the cost of a larger footprint. These retention figures are community generalizations rather than a Meta guarantee, and the real number depends on your specific task, so always confirm against your own evaluation set.
| Quantization | Approx. Size/Param | Reported Retention | Best For |
|---|---|---|---|
| FP16 (none) | 2 bytes | Reference (100%) | Evaluation baseline, maximum fidelity |
| Q8 | ~1 byte | ~95%+ | Quality-sensitive serving with memory to spare |
| INT4 | ~0.5 bytes | ~85-90% | Tight VRAM budgets, single-GPU serving |
Retention percentages are third-party generalizations, not vendor guarantees. Verify against your own evaluation set.
The Hardware Reality, Training and Inference
The gap between what Meta uses to train and what you need to serve is enormous, and it is worth seeing both ends side by side.
Training is cluster territory. Llama 3.1 405B was trained on 16,000 NVIDIA H100 GPUs, and the Llama 4 Behemoth teacher used 32,000 GPUs running FP8. These numbers are not something an individual or most companies replicate; they are why fine-tuning, not pretraining, is the realistic lever.
Inference is far more attainable, especially with quantization. Meta reports that Llama 4 Scout (109B total) fits on a single H100 80GB GPU at INT4. Maverick (400B total) needs two to four H100 GPUs, or a single H100 DGX system at FP8. Running Llama 4 at full bf16 precision requires at least four GPUs. Treat all of these as reference configurations: in practice, KV cache and activations push real memory use roughly 10 to 20 percent higher than the headline figures.
Practical takeaway: If you can serve a quantized model on one or two GPUs, you can almost certainly fine-tune a similarly sized model with QLoRA on comparable hardware. Size your fine-tuning plan against your serving hardware, not against Meta's training cluster.
Known Limitations to Test For
Fine-tuning adapts a model, but it does not erase the constraints baked in during pretraining and alignment. Build your evaluation set to probe these directly before you trust the model in production.
Older Llama generations are additionally more fragile in non-English languages than Llama 4, which expanded multilingual coverage to 200 languages. If your task is multilingual, evaluate per language rather than assuming uniform quality.
Troubleshooting and FAQ
Llama's pipeline splits cleanly into a stage you consume and a stage you control. Pretraining, with its 15-trillion-token runs and 16,000-GPU clusters, is fixed once Meta ships the weights. Post-training and fine-tuning are where your work lives, and the modern toolchain of LoRA, QLoRA, and GGUF quantization makes adapting and serving a large model realistic on a single high-memory GPU.
Match your method to your hardware, keep your dataset small and hard the way Meta did when it cut easy SFT data, quantize deliberately, and test for the documented failure modes before you ship. Do that, and you get a model shaped to your task without ever touching a training cluster.
Llama, Meta Llama, Meta AI, Facebook, Instagram, and related marks are trademarks of Meta Platforms, Inc. NVIDIA and H100 are trademarks of NVIDIA Corporation. Other product names referenced are trademarks of their respective owners. This article is not affiliated with or endorsed by Meta.