What is the difference between LoRA and QLoRA for fine-tuning Gemma?

LoRA freezes the base model weights and trains small adapter matrices in the attention layers, typically touching 0.1-1% of total parameters. QLoRA adds 4-bit NormalFloat quantization on top of LoRA, cutting VRAM usage by roughly 60% with less than 1% quality degradation. For most developers, QLoRA is the recommended approach because it delivers nearly identical results at a fraction of the hardware cost.

How much VRAM do I need to fine-tune Gemma with QLoRA?

VRAM requirements vary by model size. Gemma E2B needs about 3GB (RTX 3060), Gemma E4B needs about 6GB (RTX 3070), Gemma 26B MoE needs 14-16GB (RTX 4090), and Gemma 31B Dense needs 16-22GB (RTX 4090 or A6000). These are approximate and depend on batch size and sequence length.

How much training data do I need to fine-tune Gemma?

Data requirements depend on your goal. Style transfer needs 200-1,000 examples, task-specific training needs 500-5,000, domain adaptation needs 10,000-50,000, and general instruction tuning needs 5,000-20,000. Below 50 examples, few-shot prompting is more effective than fine-tuning.

What tools can I use to fine-tune Gemma?

Unsloth is the most popular framework for Gemma fine-tuning in 2026, offering 2x faster training and 70% less memory usage through custom CUDA kernels. Alternatives include HuggingFace Transformers with PEFT and TRL, Keras/KerasHub, Google Vertex AI, and Axolotl.

When should I NOT fine-tune Gemma?

Skip fine-tuning for general knowledge Q&A (the base model already handles this), for tasks requiring real-time information (use RAG instead), or when you have fewer than 50 training examples (use few-shot prompting). Fine-tuning shines when you need consistent formatting, domain-specific behavior, or task specialization.

Google Gemma

Gemma Fine-Tuning Guide: LoRA, QLoRA & Deployment

Fine-tuning turns a general-purpose Gemma model into a specialist that follows your formatting rules, speaks your domain language, and handles your specific tasks without lengthy system prompts. This guide walks through the entire pipeline: picking the right model, configuring LoRA or QLoRA adapters, preparing your training data, running the training loop, and exporting a production-ready model for Ollama or llama.cpp. Every code block runs on a single consumer GPU.

26B MoE

Runs on RTX 4090
with QLoRA

Gemma Docs

0.2%

Parameters trained
via LoRA adapters

PEFT Docs

Faster training
with Unsloth

Unsloth Docs

200+

Examples minimum
for style transfer

Gemma Docs

Prerequisites

Python 3.10 or newer

Tested with 3.10, 3.11, and 3.12. Avoid 3.13 until ecosystem libraries confirm support.

NVIDIA GPU with CUDA

Minimum RTX 3060 (12GB) for Gemma E2B/E4B. RTX 4090 or A6000 recommended for 26B+ models.

pip packages: unsloth, transformers, datasets, trl

Install via pip install unsloth. Unsloth pulls transformers, PEFT, and TRL automatically.

Hugging Face account with Gemma access

Accept the Gemma license at huggingface.co/google/gemma-3, then run huggingface-cli login.

Training data in JSONL or CSV format

Minimum 50 examples. Each row needs system/user/model turn structure. More detail in the data prep section.

When to Fine-Tune (and When Not To)

Fine-tuning makes sense when you need the model to consistently produce a specific output format, adopt domain vocabulary, or perform a narrow task that prompting alone cannot reliably achieve. The most common wins include enforcing JSON schema compliance, adapting tone for customer support, and teaching industry-specific classification.

However, fine-tuning is the wrong tool for several common scenarios. If the task requires knowledge that changes frequently, retrieval-augmented generation (RAG) is a better fit because you can update the knowledge base without retraining. If you have fewer than 50 quality examples, few-shot prompting will outperform a fine-tune that overfits on sparse data. And if the base model already handles your query well with the right prompt, the engineering cost of maintaining a fine-tuned model is not justified.

Stock prices, news, weather, or anything that changes daily. Use RAG with a live data source instead of baking stale facts into model weights.

The model will memorize your training set instead of generalizing. Use few-shot prompting or chain-of-thought techniques until you accumulate enough data.

The base Gemma models already excel at open-ended question answering. Fine-tuning for general knowledge risks catastrophic forgetting of built-in capabilities.

Rule of thumb: Fine-tune when you need consistent behavior on a specific task. Use RAG when you need fresh facts. Use prompting when the base model gets close enough.

LoRA vs QLoRA Explained

Both LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) avoid updating the full model weights. Instead, they freeze the base model and inject small trainable matrices into the attention layers. The difference is how they handle the frozen weights.

LoRA keeps the base model in its original precision (typically FP16 or BF16). It trains adapter matrices that represent only 0.1% to 1% of the total parameter count. The adapters themselves are small and fast to train, but the full-precision base model still occupies significant VRAM.

QLoRA adds 4-bit NormalFloat quantization to the base weights before attaching the same LoRA adapters. This cuts VRAM usage by roughly 60% compared to standard LoRA while introducing less than 1% quality degradation on most benchmarks. The adapters still train in 16-bit, so gradient computation remains stable.

Aspect	LoRA	QLoRA
Base weights	FP16 / BF16	4-bit NF4
Adapter precision	FP16	FP16 (same)
VRAM for 31B	~40 GB	~18 GB
Quality vs full FT	<0.5% loss	<1% loss
Training speed	Baseline	Slightly slower (dequant overhead)
Best for	Max quality, multi-GPU setups	Single-GPU, cost-sensitive

60%

QLoRA reduces VRAM usage compared to standard LoRA, making it practical to fine-tune 26B+ parameter models on a single consumer GPU.

For the vast majority of use cases, QLoRA is the right choice. The quality difference is negligible for task-specific fine-tuning, and the hardware savings are substantial. This guide defaults to QLoRA in all code examples. If you have access to multi-GPU servers and need every fraction of a percent of quality, switch to standard LoRA by removing the quantization config.

Choose Your Model

Gemma ships in multiple sizes. Picking the right one depends on your VRAM budget and whether you need the model to run on edge devices after training. Larger models learn faster from fewer examples but cost more to serve. Smaller models are cheaper to deploy but may need more training data to reach the same quality level.

Model	Params	QLoRA VRAM	Min GPU	Best For
Gemma E2B	2B	~3 GB	RTX 3060 12GB	Edge, mobile, low-latency tasks
Gemma E4B	4B	~6 GB	RTX 3070 8GB	Balanced quality/speed, most tasks
Gemma 26B MoE	26B	~14-16 GB	RTX 4090 24GB	Complex reasoning, multi-step tasks
Gemma 31B Dense	31B	~16-22 GB	RTX 4090 / A6000	Max quality, research

Cost perspective: Full fine-tuning the 31B dense model requires roughly 250 GB of VRAM (4x A100 80GB). LoRA drops that to ~40 GB. QLoRA drops it to ~18 GB. A fine-tuned E4B can match a prompted 31B on specific tasks at roughly 7x lower inference cost.

If you are unsure, start with Gemma E4B. It offers the best balance between training cost, inference speed, and output quality. You can always scale up to 26B MoE if the E4B results plateau.

Environment Setup with Unsloth

Unsloth is the most popular fine-tuning framework for Gemma in 2026. It delivers roughly 2x faster training and 70% less memory usage through custom CUDA kernels, and it natively supports all current Gemma model sizes. The setup is a single pip install.

Terminal

# Create a virtual environment (recommended)
python -m venv gemma-ft
source gemma-ft/bin/activate    # Linux/macOS
gemma-ft\Scripts\activate       # Windows

# Install Unsloth (pulls transformers, peft, trl, etc.)
pip install unsloth

# Log in to Hugging Face for gated model access
huggingface-cli login

Unsloth patches the standard Hugging Face training pipeline at the CUDA kernel level. You write standard transformers code and Unsloth automatically intercepts the compute-heavy operations. No API changes are needed beyond the initial model load call.

If you prefer the standard Hugging Face stack without Unsloth, install transformers, peft, trl, bitsandbytes, and datasets individually. The code examples in this guide will work with minor import changes.

Step 1: Load the Model in 4-Bit

The first step loads Gemma with 4-bit quantization active. Unsloth handles the bitsandbytes configuration internally, so you specify the quantization through its FastModel loader rather than configuring BitsAndBytesConfig manually.

Python

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it",
    max_seq_length=2048,
    load_in_4bit=True,   # QLoRA: 4-bit base weights
)

The max_seq_length parameter caps the context window for training. Set it to the longest sequence in your training data plus a small buffer. Shorter sequences use less VRAM per batch, so do not set this to the model maximum unless your data requires it.

For standard LoRA (without quantization), change load_in_4bit to False. The rest of the pipeline stays identical.

Step 2: Attach LoRA Adapters

LoRA adapters inject trainable low-rank matrices into the model's linear layers. The key parameters are rank (how many dimensions the adapter matrices have), alpha (a scaling factor, typically 2x rank), and target modules (which layers get adapters).

Python

model = FastModel.get_peft_model(
    model,
    r=16,                     # LoRA rank: 16-64
    lora_alpha=32,             # Alpha = 2x rank
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Targeting all linear layers (attention projections plus MLP gates) gives the adapter maximum flexibility to learn your task. If VRAM is tight, you can drop gate_proj, up_proj, and down_proj to reduce memory at the cost of some adaptation quality.

0.2%

With rank 16 targeting all linear layers, LoRA trains roughly 0.2% of the model's total parameters while leaving the remaining 99.8% frozen.

Gradient checkpointing ("unsloth" mode) trades a small amount of compute time for significant VRAM savings by recomputing activations during the backward pass instead of storing them. Keep this enabled unless you have VRAM to spare.

Step 3: Prepare Your Data

Gemma uses a chat template with three roles: system, user, and model. Note that Gemma uses "model" instead of "assistant" for the response role. Getting this wrong means the model learns to produce output under the wrong role token, which degrades inference quality.

Python

from datasets import load_dataset
from unsloth.chat_templates import standardize_data

# Load your JSONL dataset
dataset = load_dataset("json", data_files="training_data.jsonl")

# Each row should follow this structure:
# {"conversations": [
#   {"role": "system", "content": "You are a medical coder."},
#   {"role": "user", "content": "Code this diagnosis: ..."},
#   {"role": "model", "content": "ICD-10: E11.9 ..."}
# ]}

dataset = standardize_data(dataset)

The amount of data you need depends on what you are teaching the model:

Use Case	Examples Needed	Notes
Style transfer	200 - 1,000	Teach tone, format, brand voice
Task-specific	500 - 5,000	Classification, extraction, routing
Domain adaptation	10,000 - 50,000	Legal, medical, financial terminology
General instruction	5,000 - 20,000	Broad capability improvement

Quality matters more than quantity. One hundred clean, well-structured examples will outperform a thousand noisy ones. Deduplicate your data, remove examples with conflicting labels, and verify that the "model" responses represent the exact output format you want in production.

Step 4: Train

Training uses the standard Hugging Face SFTTrainer from the TRL library. Unsloth intercepts the underlying operations to apply its CUDA kernel optimizations. The configuration below works for most single-GPU QLoRA runs.

Python

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_ratio=0.05,
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        fp16=True,
        optim="adamw_8bit",
        seed=42,
    ),
    max_seq_length=2048,
)

trainer.train()

The adamw_8bit optimizer reduces the memory overhead of optimizer states. Combined with gradient accumulation (effective batch size = 4 x 4 = 16), this configuration maximizes throughput on a single GPU.

Monitor the training loss. It should decrease steadily for the first epoch and then flatten. If the loss drops to near zero, you are likely overfitting. Reduce epochs or increase the dataset size. If it plateaus above 1.0 and never decreases, increase the learning rate or check your data formatting.

Step 5: Export & Deploy

After training completes, you have three export options depending on your deployment target:

Option A: Save LoRA adapter only

The smallest export. Saves only the trained adapter weights (typically 50-200 MB). Requires the base model to be available at inference time.

Python

model.save_pretrained("gemma-ft-adapter")
tokenizer.save_pretrained("gemma-ft-adapter")
# Upload to Hugging Face Hub (optional)
model.push_to_hub("your-org/gemma-ft-adapter")

Option B: Merge and save full model

Merges the adapter into the base weights, producing a standalone model. Larger file size but simpler deployment.

Python

model.save_pretrained_merged(
    "gemma-ft-merged",
    tokenizer,
    save_method="merged_16bit",
)

Option C: Export to GGUF for Ollama / llama.cpp

Quantizes and exports to the GGUF format for local inference with Ollama or llama.cpp. This is the most popular deployment path for self-hosted models.

Python + Terminal

model.save_pretrained_gguf(
    "gemma-ft-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)
# Then in your terminal:
# ollama create my-gemma -f gemma-ft-gguf/Modelfile
# ollama run my-gemma

The q4_k_m quantization is the sweet spot between quality and file size for most use cases. For maximum quality, use q8_0. For minimum file size, use q4_0.

Hyperparameter Tuning Guide

The default values in the training code above work well for most cases. When you need to squeeze out more quality or troubleshoot training instability, these are the knobs to turn.

Parameter	Range	Default	Impact
LoRA rank (r)	16 - 64	16	Higher = more capacity but more VRAM. Start low, increase if underfitting.
LoRA alpha	2x rank	32	Scales the adapter contribution. Keep at 2x rank as a rule of thumb.
Learning rate	1e-4 - 3e-4	2e-4	Too high = unstable loss. Too low = slow convergence.
Epochs	1 - 5	3	More epochs = better fit, but overfitting risk above 5.
Batch size	Max that fits	4	Larger = smoother gradients. Use grad accumulation if VRAM-limited.
Warmup ratio	0.05 - 0.1	0.05	Prevents early training instability. Higher for small datasets.
Dropout	0 - 0.1	0.05	Regularization. Increase if overfitting, decrease if underfitting.

One variable at a time. Change a single hyperparameter between runs and evaluate on a held-out validation set. Changing multiple parameters simultaneously makes it impossible to attribute improvements or regressions.

Troubleshooting Common Issues

These are the problems that surface most often during Gemma fine-tuning, along with their fixes.

CUDA out of memory

Reduce per_device_train_batch_size to 1 and increase gradient_accumulation_steps to compensate. If that is still too large, reduce max_seq_length. As a last resort, drop the MLP target modules from the LoRA config to reduce adapter memory.

Loss not decreasing

Check your data formatting first. The most common cause is using "assistant" instead of "model" for the response role. Verify that your JSONL has the correct three-role structure (system, user, model). If the data is correct, increase the learning rate to 3e-4 or increase the LoRA rank.

Model outputs garbage after training

This usually means overfitting. Reduce epochs, add more training data, or increase dropout. Also check that the training data does not contain formatting artifacts (extra newlines, HTML tags, encoding issues) that the model memorized.

Gemma expects the response role to be "model," not "assistant." Training with the wrong role token will produce a model that generates text under the wrong special token, degrading output quality at inference.

Unsloth installation fails

Unsloth requires specific CUDA and PyTorch version combinations. Check the Unsloth compatibility matrix for your CUDA version. The most common fix is installing the correct PyTorch nightly build before installing Unsloth.

FAQ

Can I fine-tune Gemma on Apple Silicon (M1/M2/M3)?

Unsloth requires NVIDIA CUDA, so it does not run on Apple Silicon directly. However, you can use the standard Hugging Face stack with MPS (Metal Performance Shaders) for small models like E2B. Training will be significantly slower than CUDA. For anything larger than E4B, use a cloud GPU (Colab, RunPod, Lambda) instead.

How long does fine-tuning take?

Highly variable. A typical QLoRA run with 1,000 examples on Gemma E4B using an RTX 4090 finishes in 15 to 30 minutes with Unsloth. The same run on an RTX 3070 takes roughly 45 to 90 minutes. Larger models and datasets scale linearly.

Can I combine fine-tuning with RAG?

Yes, and this is often the ideal production setup. Fine-tune for consistent formatting, tone, and task behavior. Use RAG to inject up-to-date facts at inference time. The fine-tuned model learns how to use retrieved context effectively, while RAG prevents hallucination on factual queries.

Do I need to fine-tune again when a new Gemma version releases?

LoRA adapters are tied to the specific model architecture they were trained on. If the new version changes the layer dimensions or architecture, you need to retrain. If it is the same architecture with updated base weights, you can sometimes reuse the adapter, but retraining typically yields better results since the new base weights shift the loss landscape.

Your Progress

Load model in 4-bit

FastModel.from_pretrained with load_in_4bit=True

Attach LoRA adapters

Rank 16, alpha 32, target all linear layers

Prepare training data

JSONL with system/user/model roles

Train with SFTTrainer

3 epochs, lr 2e-4, cosine scheduler

Export and deploy

Adapter, merged model, or GGUF for Ollama

Fine-tune Gemma with Unsloth

YouTube Search

Official walkthrough of QLoRA fine-tuning with the Unsloth library.

LoRA Explained in 5 Minutes

YouTube Search

Concise visual explanation of how low-rank adaptation works under the hood.

Deploy Fine-Tuned Models with Ollama

YouTube Search

Step-by-step guide to running GGUF models locally with Ollama.

Go Deeper

Resources from across Tech Jacks Solutions

FREEAI Risk Management Template

Identify, assess, and mitigate AI deployment risks

FREEAI Bias Assessment

Evaluate bias risks before deploying any AI system

EU AI Act Guide

Check your compliance obligations under the EU AI Act

What Is Agentic AI?

Understand the architecture behind autonomous AI agents

AI Career Paths

Explore roles that work with these tools daily

Fact-checked against vendor documentation and official sources, May 2026

Gemma is a trademark of Google LLC. Hugging Face is a trademark of Hugging Face, Inc. Unsloth is a trademark of Unsloth AI. NVIDIA, CUDA, RTX, and A6000 are trademarks of NVIDIA Corporation. Ollama is a trademark of Ollama, Inc. All trademarks belong to their respective owners. Tech Jacks Solutions is not affiliated with or endorsed by any of these companies.

Gallery

Contacts

Gemma Fine-Tuning Guide: LoRA, QLoRA & Deployment

When to Fine-Tune (and When Not To)

LoRA vs QLoRA Explained

Choose Your Model

Environment Setup with Unsloth

Step 1: Load the Model in 4-Bit

Step 2: Attach LoRA Adapters

Step 3: Prepare Your Data

Step 4: Train

Step 5: Export & Deploy

Option A: Save LoRA adapter only

Option B: Merge and save full model

Option C: Export to GGUF for Ollama / llama.cpp

Hyperparameter Tuning Guide

Troubleshooting Common Issues

CUDA out of memory

Loss not decreasing

Model outputs garbage after training

Unsloth installation fails

FAQ

Can I fine-tune Gemma on Apple Silicon (M1/M2/M3)?

How long does fine-tuning take?

Can I combine fine-tuning with RAG?

Do I need to fine-tune again when a new Gemma version releases?

Go Deeper

Services

Learn

Company