Gemma Fine-Tuning Guide: LoRA, QLoRA & Deployment
Fine-tuning turns a general-purpose Gemma model into a specialist that follows your formatting rules, speaks your domain language, and handles your specific tasks without lengthy system prompts. This guide walks through the entire pipeline: picking the right model, configuring LoRA or QLoRA adapters, preparing your training data, running the training loop, and exporting a production-ready model for Ollama or llama.cpp. Every code block runs on a single consumer GPU.
When to Fine-Tune (and When Not To)
Fine-tuning makes sense when you need the model to consistently produce a specific output format, adopt domain vocabulary, or perform a narrow task that prompting alone cannot reliably achieve. The most common wins include enforcing JSON schema compliance, adapting tone for customer support, and teaching industry-specific classification.
However, fine-tuning is the wrong tool for several common scenarios. If the task requires knowledge that changes frequently, retrieval-augmented generation (RAG) is a better fit because you can update the knowledge base without retraining. If you have fewer than 50 quality examples, few-shot prompting will outperform a fine-tune that overfits on sparse data. And if the base model already handles your query well with the right prompt, the engineering cost of maintaining a fine-tuned model is not justified.
Rule of thumb: Fine-tune when you need consistent behavior on a specific task. Use RAG when you need fresh facts. Use prompting when the base model gets close enough.
LoRA vs QLoRA Explained
Both LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) avoid updating the full model weights. Instead, they freeze the base model and inject small trainable matrices into the attention layers. The difference is how they handle the frozen weights.
LoRA keeps the base model in its original precision (typically FP16 or BF16). It trains adapter matrices that represent only 0.1% to 1% of the total parameter count. The adapters themselves are small and fast to train, but the full-precision base model still occupies significant VRAM.
QLoRA adds 4-bit NormalFloat quantization to the base weights before attaching the same LoRA adapters. This cuts VRAM usage by roughly 60% compared to standard LoRA while introducing less than 1% quality degradation on most benchmarks. The adapters still train in 16-bit, so gradient computation remains stable.
| Aspect | LoRA | QLoRA |
|---|---|---|
| Base weights | FP16 / BF16 | 4-bit NF4 |
| Adapter precision | FP16 | FP16 (same) |
| VRAM for 31B | ~40 GB | ~18 GB |
| Quality vs full FT | <0.5% loss | <1% loss |
| Training speed | Baseline | Slightly slower (dequant overhead) |
| Best for | Max quality, multi-GPU setups | Single-GPU, cost-sensitive |
For the vast majority of use cases, QLoRA is the right choice. The quality difference is negligible for task-specific fine-tuning, and the hardware savings are substantial. This guide defaults to QLoRA in all code examples. If you have access to multi-GPU servers and need every fraction of a percent of quality, switch to standard LoRA by removing the quantization config.
Choose Your Model
Gemma ships in multiple sizes. Picking the right one depends on your VRAM budget and whether you need the model to run on edge devices after training. Larger models learn faster from fewer examples but cost more to serve. Smaller models are cheaper to deploy but may need more training data to reach the same quality level.
| Model | Params | QLoRA VRAM | Min GPU | Best For |
|---|---|---|---|---|
| Gemma E2B | 2B | ~3 GB | RTX 3060 12GB | Edge, mobile, low-latency tasks |
| Gemma E4B | 4B | ~6 GB | RTX 3070 8GB | Balanced quality/speed, most tasks |
| Gemma 26B MoE | 26B | ~14-16 GB | RTX 4090 24GB | Complex reasoning, multi-step tasks |
| Gemma 31B Dense | 31B | ~16-22 GB | RTX 4090 / A6000 | Max quality, research |
Cost perspective: Full fine-tuning the 31B dense model requires roughly 250 GB of VRAM (4x A100 80GB). LoRA drops that to ~40 GB. QLoRA drops it to ~18 GB. A fine-tuned E4B can match a prompted 31B on specific tasks at roughly 7x lower inference cost.
If you are unsure, start with Gemma E4B. It offers the best balance between training cost, inference speed, and output quality. You can always scale up to 26B MoE if the E4B results plateau.
Environment Setup with Unsloth
Unsloth is the most popular fine-tuning framework for Gemma in 2026. It delivers roughly 2x faster training and 70% less memory usage through custom CUDA kernels, and it natively supports all current Gemma model sizes. The setup is a single pip install.
# Create a virtual environment (recommended) python -m venv gemma-ft source gemma-ft/bin/activate # Linux/macOS gemma-ft\Scripts\activate # Windows # Install Unsloth (pulls transformers, peft, trl, etc.) pip install unsloth # Log in to Hugging Face for gated model access huggingface-cli login
Unsloth patches the standard Hugging Face training pipeline at the CUDA kernel level. You write standard transformers code and Unsloth automatically intercepts the compute-heavy operations. No API changes are needed beyond the initial model load call.
If you prefer the standard Hugging Face stack without Unsloth, install transformers, peft, trl, bitsandbytes, and datasets individually. The code examples in this guide will work with minor import changes.
Step 1: Load the Model in 4-Bit
The first step loads Gemma with 4-bit quantization active. Unsloth handles the bitsandbytes configuration internally, so you specify the quantization through its FastModel loader rather than configuring BitsAndBytesConfig manually.
from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="unsloth/gemma-3-4b-it", max_seq_length=2048, load_in_4bit=True, # QLoRA: 4-bit base weights )
The max_seq_length parameter caps the context window for training. Set it to the longest sequence in your training data plus a small buffer. Shorter sequences use less VRAM per batch, so do not set this to the model maximum unless your data requires it.
For standard LoRA (without quantization), change load_in_4bit to False. The rest of the pipeline stays identical.
Step 2: Attach LoRA Adapters
LoRA adapters inject trainable low-rank matrices into the model's linear layers. The key parameters are rank (how many dimensions the adapter matrices have), alpha (a scaling factor, typically 2x rank), and target modules (which layers get adapters).
model = FastModel.get_peft_model(
model,
r=16, # LoRA rank: 16-64
lora_alpha=32, # Alpha = 2x rank
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
use_gradient_checkpointing="unsloth",
)Targeting all linear layers (attention projections plus MLP gates) gives the adapter maximum flexibility to learn your task. If VRAM is tight, you can drop gate_proj, up_proj, and down_proj to reduce memory at the cost of some adaptation quality.
Gradient checkpointing ("unsloth" mode) trades a small amount of compute time for significant VRAM savings by recomputing activations during the backward pass instead of storing them. Keep this enabled unless you have VRAM to spare.
Step 3: Prepare Your Data
Gemma uses a chat template with three roles: system, user, and model. Note that Gemma uses "model" instead of "assistant" for the response role. Getting this wrong means the model learns to produce output under the wrong role token, which degrades inference quality.
from datasets import load_dataset from unsloth.chat_templates import standardize_data # Load your JSONL dataset dataset = load_dataset("json", data_files="training_data.jsonl") # Each row should follow this structure: # {"conversations": [ # {"role": "system", "content": "You are a medical coder."}, # {"role": "user", "content": "Code this diagnosis: ..."}, # {"role": "model", "content": "ICD-10: E11.9 ..."} # ]} dataset = standardize_data(dataset)
The amount of data you need depends on what you are teaching the model:
| Use Case | Examples Needed | Notes |
|---|---|---|
| Style transfer | 200 - 1,000 | Teach tone, format, brand voice |
| Task-specific | 500 - 5,000 | Classification, extraction, routing |
| Domain adaptation | 10,000 - 50,000 | Legal, medical, financial terminology |
| General instruction | 5,000 - 20,000 | Broad capability improvement |
Quality matters more than quantity. One hundred clean, well-structured examples will outperform a thousand noisy ones. Deduplicate your data, remove examples with conflicting labels, and verify that the "model" responses represent the exact output format you want in production.
Step 4: Train
Training uses the standard Hugging Face SFTTrainer from the TRL library. Unsloth intercepts the underlying operations to apply its CUDA kernel optimizations. The configuration below works for most single-GPU QLoRA runs.
from trl import SFTTrainer from transformers import TrainingArguments trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset["train"], args=TrainingArguments( output_dir="./output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-4, warmup_ratio=0.05, weight_decay=0.01, lr_scheduler_type="cosine", logging_steps=10, save_strategy="epoch", fp16=True, optim="adamw_8bit", seed=42, ), max_seq_length=2048, ) trainer.train()
The adamw_8bit optimizer reduces the memory overhead of optimizer states. Combined with gradient accumulation (effective batch size = 4 x 4 = 16), this configuration maximizes throughput on a single GPU.
Monitor the training loss. It should decrease steadily for the first epoch and then flatten. If the loss drops to near zero, you are likely overfitting. Reduce epochs or increase the dataset size. If it plateaus above 1.0 and never decreases, increase the learning rate or check your data formatting.
Step 5: Export & Deploy
After training completes, you have three export options depending on your deployment target:
Option A: Save LoRA adapter only
The smallest export. Saves only the trained adapter weights (typically 50-200 MB). Requires the base model to be available at inference time.
model.save_pretrained("gemma-ft-adapter") tokenizer.save_pretrained("gemma-ft-adapter") # Upload to Hugging Face Hub (optional) model.push_to_hub("your-org/gemma-ft-adapter")
Option B: Merge and save full model
Merges the adapter into the base weights, producing a standalone model. Larger file size but simpler deployment.
model.save_pretrained_merged(
"gemma-ft-merged",
tokenizer,
save_method="merged_16bit",
)Option C: Export to GGUF for Ollama / llama.cpp
Quantizes and exports to the GGUF format for local inference with Ollama or llama.cpp. This is the most popular deployment path for self-hosted models.
model.save_pretrained_gguf(
"gemma-ft-gguf",
tokenizer,
quantization_method="q4_k_m",
)
# Then in your terminal:
# ollama create my-gemma -f gemma-ft-gguf/Modelfile
# ollama run my-gemmaThe q4_k_m quantization is the sweet spot between quality and file size for most use cases. For maximum quality, use q8_0. For minimum file size, use q4_0.
Hyperparameter Tuning Guide
The default values in the training code above work well for most cases. When you need to squeeze out more quality or troubleshoot training instability, these are the knobs to turn.
| Parameter | Range | Default | Impact |
|---|---|---|---|
| LoRA rank (r) | 16 - 64 | 16 | Higher = more capacity but more VRAM. Start low, increase if underfitting. |
| LoRA alpha | 2x rank | 32 | Scales the adapter contribution. Keep at 2x rank as a rule of thumb. |
| Learning rate | 1e-4 - 3e-4 | 2e-4 | Too high = unstable loss. Too low = slow convergence. |
| Epochs | 1 - 5 | 3 | More epochs = better fit, but overfitting risk above 5. |
| Batch size | Max that fits | 4 | Larger = smoother gradients. Use grad accumulation if VRAM-limited. |
| Warmup ratio | 0.05 - 0.1 | 0.05 | Prevents early training instability. Higher for small datasets. |
| Dropout | 0 - 0.1 | 0.05 | Regularization. Increase if overfitting, decrease if underfitting. |
One variable at a time. Change a single hyperparameter between runs and evaluate on a held-out validation set. Changing multiple parameters simultaneously makes it impossible to attribute improvements or regressions.
Troubleshooting Common Issues
These are the problems that surface most often during Gemma fine-tuning, along with their fixes.
CUDA out of memory
Reduce per_device_train_batch_size to 1 and increase gradient_accumulation_steps to compensate. If that is still too large, reduce max_seq_length. As a last resort, drop the MLP target modules from the LoRA config to reduce adapter memory.
Loss not decreasing
Check your data formatting first. The most common cause is using "assistant" instead of "model" for the response role. Verify that your JSONL has the correct three-role structure (system, user, model). If the data is correct, increase the learning rate to 3e-4 or increase the LoRA rank.
Model outputs garbage after training
This usually means overfitting. Reduce epochs, add more training data, or increase dropout. Also check that the training data does not contain formatting artifacts (extra newlines, HTML tags, encoding issues) that the model memorized.
Unsloth installation fails
Unsloth requires specific CUDA and PyTorch version combinations. Check the Unsloth compatibility matrix for your CUDA version. The most common fix is installing the correct PyTorch nightly build before installing Unsloth.
FAQ
Can I fine-tune Gemma on Apple Silicon (M1/M2/M3)?
Unsloth requires NVIDIA CUDA, so it does not run on Apple Silicon directly. However, you can use the standard Hugging Face stack with MPS (Metal Performance Shaders) for small models like E2B. Training will be significantly slower than CUDA. For anything larger than E4B, use a cloud GPU (Colab, RunPod, Lambda) instead.
How long does fine-tuning take?
Highly variable. A typical QLoRA run with 1,000 examples on Gemma E4B using an RTX 4090 finishes in 15 to 30 minutes with Unsloth. The same run on an RTX 3070 takes roughly 45 to 90 minutes. Larger models and datasets scale linearly.
Can I combine fine-tuning with RAG?
Yes, and this is often the ideal production setup. Fine-tune for consistent formatting, tone, and task behavior. Use RAG to inject up-to-date facts at inference time. The fine-tuned model learns how to use retrieved context effectively, while RAG prevents hallucination on factual queries.
Do I need to fine-tune again when a new Gemma version releases?
LoRA adapters are tied to the specific model architecture they were trained on. If the new version changes the layer dimensions or architecture, you need to retrain. If it is the same architecture with updated base weights, you can sometimes reuse the adapter, but retraining typically yields better results since the new base weights shift the loss landscape.
Go Deeper
Resources from across Tech Jacks Solutions
FREEAI Risk Management Template
Identify, assess, and mitigate AI deployment risks
FREEAI Bias Assessment
Evaluate bias risks before deploying any AI system
EU AI Act Guide
Check your compliance obligations under the EU AI Act
What Is Agentic AI?
Understand the architecture behind autonomous AI agents
AI Career Paths
Explore roles that work with these tools daily