Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Meta Llama

Inside Llama's Training and Fine-Tuning Process (2026)

Last verified: June 2026  ·  Format: Guide  ·  Est. time: 18-22 min

Every Llama model you download is the product of two very different processes: a massive, expensive pretraining run that Meta performs on tens of thousands of GPUs, followed by a much lighter post-training stage that turns a raw next-token predictor into a helpful, safety-aligned assistant. Understanding where that boundary sits is the key to fine-tuning effectively, because the part you can realistically adapt on your own hardware is the second part, not the first.

This guide walks through the full pipeline as Meta documents it, from the raw token counts behind Llama 2, Llama 3.1, and Llama 4, through the post-training methods Meta uses to align each release, and into the practical techniques developers use to adapt these models: LoRA, QLoRA, and GGUF quantization. By the end you will know which fine-tuning method fits your GPU, how to prepare a run, and which known failure modes to test for before you ship anything.

15T
Pretraining tokens for Llama 3.1, up from 2T for Llama 2
Source: Meta Llama documentation (Jul 2024)
16,000
NVIDIA H100 GPUs used to train Llama 3.1 405B
Source: Meta Llama documentation (Jul 2024)
~48 GB
VRAM a single GPU needs to QLoRA-tune a large model
Source: Developer fine-tuning guides (2026)
85-90%
Performance retained at INT4 (~0.5 bytes/param)
Third-party generalization, not a Meta guarantee

What You Need Before You Start

You cannot reproduce Llama's pretraining at home, and you do not need to. Fine-tuning starts from Meta's released base or instruct weights and adapts them to your task. Before you begin, make five decisions: which adaptation method to use, what quantization to target, where to get Meta's reference scripts, how much GPU memory you have, and how you will measure success.

Prerequisites Checklist
Chosen an adaptation method: full fine-tuning, LoRA, or QLoRA
Picked a target quantization for serving (Q8 for quality, INT4 for footprint)
Cloned Meta's llama-models repository and the Llama Cookbook reference scripts
Sized your GPU against the model and method (a ~48 GB card covers QLoRA on large models)
Defined an evaluation set that reflects your real task, not just generic benchmarks
Accepted the Llama Community License at llama.com
0 of 6 complete
Fine-Tuning Workflow Progress
0 of 5 steps complete
  • Step 1: Pick a method by available VRAM
  • Step 2: Prepare and format your data
  • Step 3: Run a LoRA or QLoRA pass
  • Step 4: Quantize to GGUF for serving
  • Step 5: Evaluate and test failure modes

How Meta Pretrains Llama

Pretraining is the stage where a model learns language, facts, and reasoning patterns by predicting the next token across trillions of words. This is the part that costs millions of dollars and occupies GPU clusters for weeks, and it is the part that scaled dramatically across Llama generations.

Llama 2 was trained on 2 trillion tokens in July 2023. Meta states it excluded Meta user data from the pretraining mix. Llama 3.1, released in July 2024, jumped to 15 trillion tokens and was trained on 16,000 NVIDIA H100 GPUs, with web data filtered by a quality classifier that was itself built on Llama 2.

Llama 4, released in April 2025, marks the largest architectural shift. Meta reports more than 30 trillion tokens of pretraining data and a move to a Mixture-of-Experts design with early-fusion multimodality, meaning text, image, and video tokens are processed together rather than bolted on afterward. Llama 4 supports 200 languages, uses FP8 precision during training, and was distilled from a 2-trillion-parameter teacher model called Behemoth, which Meta says was trained on 32,000 GPUs. Notably, the Llama 4 data mix now includes public Facebook and Instagram posts plus Meta AI interactions, a departure from the user-data exclusion Meta described for Llama 2.

Two token figures, two sources: Meta's own materials describe Llama 4 pretraining as "more than 30 trillion tokens" overall. Microsoft's model catalog lists roughly 40 trillion tokens for Llama 4 Scout and roughly 22 trillion for Maverick. Both figures are presented here as published; they describe different variants and different accounting, so treat the headline as a range rather than a single number.

Pretraining at a Glance

Generation Pretraining Tokens Training Hardware Post-Training Approach
Llama 2 (Jul 2023) 2T Not individually itemized; no Meta user data SFT + rejection sampling + DPO/PPO
Llama 3.1 (Jul 2024) 15T 16,000 NVIDIA H100 (405B) SFT + rejection sampling + DPO/PPO (RLHF-V1..V5)
Llama 4 (Apr 2025) 30T+ (Meta) FP8 MoE; Behemoth teacher on 32,000 GPUs Lightweight SFT → online RL → lightweight DPO

Token counts: Meta Llama documentation. Microsoft's catalog separately lists ~40T (Scout) / ~22T (Maverick) for Llama 4. Both figures published; presented as a range.

Post-Training: From Raw Model to Assistant

A freshly pretrained model can complete text but does not reliably follow instructions or refuse harmful requests. Post-training fixes that. This is also the conceptual blueprint your own fine-tuning imitates, just at a far smaller scale.

The Llama 2 and Llama 3.1 Recipe

For Llama 2 and Llama 3.1, Meta used a multi-stage alignment pipeline: supervised fine-tuning (SFT) on curated instruction data, rejection sampling to keep the best generated responses, and preference optimization through Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), two methods for teaching the model which answers people prefer, across several reward-model iterations of reinforcement learning from human feedback (RLHF) that Meta labels RLHF-V1 through RLHF-V5. Safety was reinforced through context distillation, where the model learns to internalize safe behavior demonstrated by a safety-prompted teacher.

The Llama 4 Revamp

Meta rebuilt the post-training pipeline for Llama 4 around a leaner sequence: a lightweight SFT pass, then online reinforcement learning, then a lightweight DPO pass. Crucially, Meta says it dropped more than 50% of the "easy" SFT examples on the theory that they added cost without improving capability, concentrating effort on harder cases. The smaller Llama 4 models were codistilled from the Behemoth teacher, inheriting its capabilities without its parameter count.

Why this matters for you: Your fine-tuning is essentially a compressed version of stage one, SFT, sometimes followed by a small DPO pass. You are not redoing pretraining; you are nudging an already-aligned model toward your domain. Keeping your dataset small, clean, and hard is the same lesson Meta applied when it cut easy SFT data.

Developer Fine-Tuning: LoRA and QLoRA

Full fine-tuning updates every weight in the model, which means you need enough memory to hold the model, its gradients, and optimizer states all at once. For anything above a small model, that is impractical on a single GPU. Parameter-efficient fine-tuning (PEFT) solves this.

LoRA: Freeze the Base, Train Small Adapters

LoRA (Low-Rank Adaptation) freezes the base model's weights entirely and injects small, low-rank trainable matrices into the layers. Only those adapter matrices are trained, which slashes the number of updated parameters by orders of magnitude. Because the base stays frozen, memory pressure drops sharply and you can keep multiple task-specific adapters for one shared base model.

QLoRA: Quantize First, Then Adapt

QLoRA goes further by quantizing the frozen base model to a lower precision before attaching the LoRA adapters. The combination is what makes single-GPU fine-tuning of a large model realistic: the quantized base fits in roughly 48 GB of VRAM, and the trainable adapters add little on top. For most developers without a multi-GPU cluster, QLoRA is the default path to adapting a big Llama model.

Step 1: Pick a Method by Available VRAM

Match the method to the GPU you actually have. The table below maps the three approaches to their memory profile and the situation each one suits.

Method VRAM Profile Relative Speed When to Use
Full fine-tuning Highest; needs weights + gradients + optimizer states (often multi-GPU) Slowest per result, most thorough You have cluster-class hardware and need to change the model deeply
LoRA Moderate; base frozen, only adapters trained Fast You have a mid-to-large GPU and want swappable task adapters
QLoRA Lowest; quantized base fits a single ~48 GB GPU Fast, slight overhead from quantization You have one large-VRAM GPU and need to tune a big model

Source: developer fine-tuning guides (PEFT/LoRA/QLoRA), 2026. Exact VRAM depends on model size, sequence length, and batch size.

Step 2: Prepare and Format Your Data

Quality beats quantity. Meta's own Llama 4 work showed that cutting easy examples and keeping hard ones improved results. Format your examples to match the model's expected instruction template, deduplicate aggressively, and hold out a representative slice for evaluation before you train on the rest.

Step 3: Run a LoRA or QLoRA Pass

Meta publishes reference fine-tuning scripts in its llama-models repository and the companion Llama Cookbook, which cover end-to-end PEFT workflows. For a PyTorch-native option, torchtune is the official PyTorch fine-tuning library and is a common starting point for LoRA and QLoRA recipes. Start from a working reference recipe rather than writing the training loop from scratch, then change only the data path, the adapter rank, and the learning rate for your first run.

Scope note: torchtune is the recognized PyTorch-native fine-tuning library, but its specific recipe details fall outside the primary vendor documentation cited for this guide. Treat the library as the official PyTorch-native option and confirm exact commands and config flags against its current documentation before you run.

Step 4: Quantize to GGUF for Serving

Once you have a fine-tuned model, quantization shrinks it so it fits and runs efficiently on the hardware you serve from. The dominant format for local serving is GGUF, produced and consumed through the llama.cpp ecosystem and tools built on it such as Ollama and LM Studio.

Quantization is a quality-versus-size tradeoff. At INT4, weights occupy roughly 0.5 bytes per parameter and third-party testing generally reports around 85 to 90 percent of full-precision performance retained. Stepping up to Q8 lifts that to roughly 95 percent or better at the cost of a larger footprint. These retention figures are community generalizations rather than a Meta guarantee, and the real number depends on your specific task, so always confirm against your own evaluation set.

Quantization Approx. Size/Param Reported Retention Best For
FP16 (none) 2 bytes Reference (100%) Evaluation baseline, maximum fidelity
Q8 ~1 byte ~95%+ Quality-sensitive serving with memory to spare
INT4 ~0.5 bytes ~85-90% Tight VRAM budgets, single-GPU serving

Retention percentages are third-party generalizations, not vendor guarantees. Verify against your own evaluation set.

The Hardware Reality, Training and Inference

The gap between what Meta uses to train and what you need to serve is enormous, and it is worth seeing both ends side by side.

Training is cluster territory. Llama 3.1 405B was trained on 16,000 NVIDIA H100 GPUs, and the Llama 4 Behemoth teacher used 32,000 GPUs running FP8. These numbers are not something an individual or most companies replicate; they are why fine-tuning, not pretraining, is the realistic lever.

Inference is far more attainable, especially with quantization. Meta reports that Llama 4 Scout (109B total) fits on a single H100 80GB GPU at INT4. Maverick (400B total) needs two to four H100 GPUs, or a single H100 DGX system at FP8. Running Llama 4 at full bf16 precision requires at least four GPUs. Treat all of these as reference configurations: in practice, KV cache and activations push real memory use roughly 10 to 20 percent higher than the headline figures.

Practical takeaway: If you can serve a quantized model on one or two GPUs, you can almost certainly fine-tune a similarly sized model with QLoRA on comparable hardware. Size your fine-tuning plan against your serving hardware, not against Meta's training cluster.

Known Limitations to Test For

Fine-tuning adapts a model, but it does not erase the constraints baked in during pretraining and alignment. Build your evaluation set to probe these directly before you trust the model in production.

Limitations and Failure Modes
Fixed Knowledge Cutoff
Llama 4 carries an official knowledge cutoff of August 2024. Fine-tuning on recent data adds facts at the edges but does not retrain the base, so the model has no inherent awareness of events after its cutoff. Pair it with retrieval for current information.
False Refusals on Borderline Prompts
Meta reports false refusals of roughly 0.05% on its helpfulness set, with a higher rate on genuinely borderline prompts. Aggressive safety tuning can push this up, so test that your fine-tuned model still answers legitimate edge-case questions.
Tokenization and Number Comparison
Third-party testers documented Llama 3.1 405B concluding that 9.11 is greater than 9.9, a known tokenization-driven arithmetic error. Do not assume fine-tuning fixes numeric reasoning; test math-sensitive paths explicitly.
Lost-in-the-Middle and Long Context
Independent analyses report long-context degradation where information in the middle of a long input is recalled less reliably than information at the ends. Snowball hallucination and errors passing values between sequential tool calls are also documented. Validate long-context and multi-step tool tasks before relying on them.

Older Llama generations are additionally more fragile in non-English languages than Llama 4, which expanded multilingual coverage to 200 languages. If your task is multilingual, evaluate per language rather than assuming uniform quality.

Troubleshooting and FAQ

Common Questions
Do I need to pretrain anything to fine-tune Llama?+
No. You start from Meta's released base or instruct weights and adapt them. Pretraining the 15-trillion-token base requires thousands of H100 GPUs and is Meta's job, not yours. Your work happens in the post-training and PEFT layer, typically with LoRA or QLoRA.
My GPU runs out of memory during a LoRA run. What now?+
Switch to QLoRA, which quantizes the frozen base so a large model fits on a single roughly 48 GB GPU. You can also reduce sequence length, lower the batch size, or pick a smaller model variant. Remember that real memory use runs about 10 to 20 percent above headline figures because of KV cache and activations.
Which library should I use to run the fine-tune?+
Meta's llama-models repository and the Llama Cookbook provide reference PEFT scripts. For a PyTorch-native workflow, torchtune is the official PyTorch fine-tuning library. Confirm exact recipe details against each project's current documentation before running.
How much quality do I lose by quantizing to INT4?+
Third-party testing generally reports around 85 to 90 percent of full-precision performance retained at INT4 (~0.5 bytes/param), rising to roughly 95 percent or better at Q8. These are community generalizations, not Meta guarantees, so measure on your own evaluation set rather than trusting a single percentage.
Why do the Llama 4 token counts differ between sources?+
Meta's materials describe Llama 4 pretraining as more than 30 trillion tokens overall, while Microsoft's model catalog lists roughly 40 trillion for Scout and roughly 22 trillion for Maverick. These describe different variants and different accounting. Treat the figure as a published range rather than one exact number.

Llama's pipeline splits cleanly into a stage you consume and a stage you control. Pretraining, with its 15-trillion-token runs and 16,000-GPU clusters, is fixed once Meta ships the weights. Post-training and fine-tuning are where your work lives, and the modern toolchain of LoRA, QLoRA, and GGUF quantization makes adapting and serving a large model realistic on a single high-memory GPU.

Match your method to your hardware, keep your dataset small and hard the way Meta did when it cut easy SFT data, quantize deliberately, and test for the documented failure modes before you ship. Do that, and you get a model shaped to your task without ever touching a training cluster.

Fact-checked against vendor documentation and developer guides, June 2026.

Llama, Meta Llama, Meta AI, Facebook, Instagram, and related marks are trademarks of Meta Platforms, Inc. NVIDIA and H100 are trademarks of NVIDIA Corporation. Other product names referenced are trademarks of their respective owners. This article is not affiliated with or endorsed by Meta.