Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Google Gemma

Gemma Fine-Tuning Guide: LoRA, QLoRA & Deployment

Fine-tuning turns a general-purpose Gemma model into a specialist that follows your formatting rules, speaks your domain language, and handles your specific tasks without lengthy system prompts. This guide walks through the entire pipeline: picking the right model, configuring LoRA or QLoRA adapters, preparing your training data, running the training loop, and exporting a production-ready model for Ollama or llama.cpp. Every code block runs on a single consumer GPU.


26B MoE
Runs on RTX 4090
with QLoRA
0.2%
Parameters trained
via LoRA adapters
2x
Faster training
with Unsloth
200+
Examples minimum
for style transfer
Prerequisites
Python 3.10 or newer
Tested with 3.10, 3.11, and 3.12. Avoid 3.13 until ecosystem libraries confirm support.
NVIDIA GPU with CUDA
Minimum RTX 3060 (12GB) for Gemma E2B/E4B. RTX 4090 or A6000 recommended for 26B+ models.
pip packages: unsloth, transformers, datasets, trl
Install via pip install unsloth. Unsloth pulls transformers, PEFT, and TRL automatically.
Hugging Face account with Gemma access
Accept the Gemma license at huggingface.co/google/gemma-3, then run huggingface-cli login.
Training data in JSONL or CSV format
Minimum 50 examples. Each row needs system/user/model turn structure. More detail in the data prep section.

When to Fine-Tune (and When Not To)

Fine-tuning makes sense when you need the model to consistently produce a specific output format, adopt domain vocabulary, or perform a narrow task that prompting alone cannot reliably achieve. The most common wins include enforcing JSON schema compliance, adapting tone for customer support, and teaching industry-specific classification.

However, fine-tuning is the wrong tool for several common scenarios. If the task requires knowledge that changes frequently, retrieval-augmented generation (RAG) is a better fit because you can update the knowledge base without retraining. If you have fewer than 50 quality examples, few-shot prompting will outperform a fine-tune that overfits on sparse data. And if the base model already handles your query well with the right prompt, the engineering cost of maintaining a fine-tuned model is not justified.

Skip fine-tuning for real-time info
Stock prices, news, weather, or anything that changes daily. Use RAG with a live data source instead of baking stale facts into model weights.
Skip fine-tuning with fewer than 50 examples
The model will memorize your training set instead of generalizing. Use few-shot prompting or chain-of-thought techniques until you accumulate enough data.
Skip fine-tuning for general knowledge Q&A
The base Gemma models already excel at open-ended question answering. Fine-tuning for general knowledge risks catastrophic forgetting of built-in capabilities.

Rule of thumb: Fine-tune when you need consistent behavior on a specific task. Use RAG when you need fresh facts. Use prompting when the base model gets close enough.


LoRA vs QLoRA Explained

Both LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) avoid updating the full model weights. Instead, they freeze the base model and inject small trainable matrices into the attention layers. The difference is how they handle the frozen weights.

LoRA keeps the base model in its original precision (typically FP16 or BF16). It trains adapter matrices that represent only 0.1% to 1% of the total parameter count. The adapters themselves are small and fast to train, but the full-precision base model still occupies significant VRAM.

QLoRA adds 4-bit NormalFloat quantization to the base weights before attaching the same LoRA adapters. This cuts VRAM usage by roughly 60% compared to standard LoRA while introducing less than 1% quality degradation on most benchmarks. The adapters still train in 16-bit, so gradient computation remains stable.

AspectLoRAQLoRA
Base weightsFP16 / BF164-bit NF4
Adapter precisionFP16FP16 (same)
VRAM for 31B~40 GB~18 GB
Quality vs full FT<0.5% loss<1% loss
Training speedBaselineSlightly slower (dequant overhead)
Best forMax quality, multi-GPU setupsSingle-GPU, cost-sensitive
60%
QLoRA reduces VRAM usage compared to standard LoRA, making it practical to fine-tune 26B+ parameter models on a single consumer GPU.

For the vast majority of use cases, QLoRA is the right choice. The quality difference is negligible for task-specific fine-tuning, and the hardware savings are substantial. This guide defaults to QLoRA in all code examples. If you have access to multi-GPU servers and need every fraction of a percent of quality, switch to standard LoRA by removing the quantization config.


Choose Your Model

Gemma ships in multiple sizes. Picking the right one depends on your VRAM budget and whether you need the model to run on edge devices after training. Larger models learn faster from fewer examples but cost more to serve. Smaller models are cheaper to deploy but may need more training data to reach the same quality level.

ModelParamsQLoRA VRAMMin GPUBest For
Gemma E2B2B~3 GBRTX 3060 12GBEdge, mobile, low-latency tasks
Gemma E4B4B~6 GBRTX 3070 8GBBalanced quality/speed, most tasks
Gemma 26B MoE26B~14-16 GBRTX 4090 24GBComplex reasoning, multi-step tasks
Gemma 31B Dense31B~16-22 GBRTX 4090 / A6000Max quality, research

Cost perspective: Full fine-tuning the 31B dense model requires roughly 250 GB of VRAM (4x A100 80GB). LoRA drops that to ~40 GB. QLoRA drops it to ~18 GB. A fine-tuned E4B can match a prompted 31B on specific tasks at roughly 7x lower inference cost.

If you are unsure, start with Gemma E4B. It offers the best balance between training cost, inference speed, and output quality. You can always scale up to 26B MoE if the E4B results plateau.


Environment Setup with Unsloth

Unsloth is the most popular fine-tuning framework for Gemma in 2026. It delivers roughly 2x faster training and 70% less memory usage through custom CUDA kernels, and it natively supports all current Gemma model sizes. The setup is a single pip install.

Terminal
# Create a virtual environment (recommended)
python -m venv gemma-ft
source gemma-ft/bin/activate    # Linux/macOS
gemma-ft\Scripts\activate       # Windows

# Install Unsloth (pulls transformers, peft, trl, etc.)
pip install unsloth

# Log in to Hugging Face for gated model access
huggingface-cli login

Unsloth patches the standard Hugging Face training pipeline at the CUDA kernel level. You write standard transformers code and Unsloth automatically intercepts the compute-heavy operations. No API changes are needed beyond the initial model load call.

If you prefer the standard Hugging Face stack without Unsloth, install transformers, peft, trl, bitsandbytes, and datasets individually. The code examples in this guide will work with minor import changes.


Step 1: Load the Model in 4-Bit

The first step loads Gemma with 4-bit quantization active. Unsloth handles the bitsandbytes configuration internally, so you specify the quantization through its FastModel loader rather than configuring BitsAndBytesConfig manually.

Python
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it",
    max_seq_length=2048,
    load_in_4bit=True,   # QLoRA: 4-bit base weights
)

The max_seq_length parameter caps the context window for training. Set it to the longest sequence in your training data plus a small buffer. Shorter sequences use less VRAM per batch, so do not set this to the model maximum unless your data requires it.

For standard LoRA (without quantization), change load_in_4bit to False. The rest of the pipeline stays identical.


Step 2: Attach LoRA Adapters

LoRA adapters inject trainable low-rank matrices into the model's linear layers. The key parameters are rank (how many dimensions the adapter matrices have), alpha (a scaling factor, typically 2x rank), and target modules (which layers get adapters).

Python
model = FastModel.get_peft_model(
    model,
    r=16,                     # LoRA rank: 16-64
    lora_alpha=32,             # Alpha = 2x rank
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Targeting all linear layers (attention projections plus MLP gates) gives the adapter maximum flexibility to learn your task. If VRAM is tight, you can drop gate_proj, up_proj, and down_proj to reduce memory at the cost of some adaptation quality.

0.2%
With rank 16 targeting all linear layers, LoRA trains roughly 0.2% of the model's total parameters while leaving the remaining 99.8% frozen.

Gradient checkpointing ("unsloth" mode) trades a small amount of compute time for significant VRAM savings by recomputing activations during the backward pass instead of storing them. Keep this enabled unless you have VRAM to spare.


Step 3: Prepare Your Data

Gemma uses a chat template with three roles: system, user, and model. Note that Gemma uses "model" instead of "assistant" for the response role. Getting this wrong means the model learns to produce output under the wrong role token, which degrades inference quality.

Python
from datasets import load_dataset
from unsloth.chat_templates import standardize_data

# Load your JSONL dataset
dataset = load_dataset("json", data_files="training_data.jsonl")

# Each row should follow this structure:
# {"conversations": [
#   {"role": "system", "content": "You are a medical coder."},
#   {"role": "user", "content": "Code this diagnosis: ..."},
#   {"role": "model", "content": "ICD-10: E11.9 ..."}
# ]}

dataset = standardize_data(dataset)

The amount of data you need depends on what you are teaching the model:

Use CaseExamples NeededNotes
Style transfer200 - 1,000Teach tone, format, brand voice
Task-specific500 - 5,000Classification, extraction, routing
Domain adaptation10,000 - 50,000Legal, medical, financial terminology
General instruction5,000 - 20,000Broad capability improvement

Quality matters more than quantity. One hundred clean, well-structured examples will outperform a thousand noisy ones. Deduplicate your data, remove examples with conflicting labels, and verify that the "model" responses represent the exact output format you want in production.


Step 4: Train

Training uses the standard Hugging Face SFTTrainer from the TRL library. Unsloth intercepts the underlying operations to apply its CUDA kernel optimizations. The configuration below works for most single-GPU QLoRA runs.

Python
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_ratio=0.05,
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        fp16=True,
        optim="adamw_8bit",
        seed=42,
    ),
    max_seq_length=2048,
)

trainer.train()

The adamw_8bit optimizer reduces the memory overhead of optimizer states. Combined with gradient accumulation (effective batch size = 4 x 4 = 16), this configuration maximizes throughput on a single GPU.

Monitor the training loss. It should decrease steadily for the first epoch and then flatten. If the loss drops to near zero, you are likely overfitting. Reduce epochs or increase the dataset size. If it plateaus above 1.0 and never decreases, increase the learning rate or check your data formatting.


Step 5: Export & Deploy

After training completes, you have three export options depending on your deployment target:

Option A: Save LoRA adapter only

The smallest export. Saves only the trained adapter weights (typically 50-200 MB). Requires the base model to be available at inference time.

Python
model.save_pretrained("gemma-ft-adapter")
tokenizer.save_pretrained("gemma-ft-adapter")
# Upload to Hugging Face Hub (optional)
model.push_to_hub("your-org/gemma-ft-adapter")

Option B: Merge and save full model

Merges the adapter into the base weights, producing a standalone model. Larger file size but simpler deployment.

Python
model.save_pretrained_merged(
    "gemma-ft-merged",
    tokenizer,
    save_method="merged_16bit",
)

Option C: Export to GGUF for Ollama / llama.cpp

Quantizes and exports to the GGUF format for local inference with Ollama or llama.cpp. This is the most popular deployment path for self-hosted models.

Python + Terminal
model.save_pretrained_gguf(
    "gemma-ft-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)
# Then in your terminal:
# ollama create my-gemma -f gemma-ft-gguf/Modelfile
# ollama run my-gemma

The q4_k_m quantization is the sweet spot between quality and file size for most use cases. For maximum quality, use q8_0. For minimum file size, use q4_0.


Hyperparameter Tuning Guide

The default values in the training code above work well for most cases. When you need to squeeze out more quality or troubleshoot training instability, these are the knobs to turn.

ParameterRangeDefaultImpact
LoRA rank (r)16 - 6416Higher = more capacity but more VRAM. Start low, increase if underfitting.
LoRA alpha2x rank32Scales the adapter contribution. Keep at 2x rank as a rule of thumb.
Learning rate1e-4 - 3e-42e-4Too high = unstable loss. Too low = slow convergence.
Epochs1 - 53More epochs = better fit, but overfitting risk above 5.
Batch sizeMax that fits4Larger = smoother gradients. Use grad accumulation if VRAM-limited.
Warmup ratio0.05 - 0.10.05Prevents early training instability. Higher for small datasets.
Dropout0 - 0.10.05Regularization. Increase if overfitting, decrease if underfitting.

One variable at a time. Change a single hyperparameter between runs and evaluate on a held-out validation set. Changing multiple parameters simultaneously makes it impossible to attribute improvements or regressions.


Troubleshooting Common Issues

These are the problems that surface most often during Gemma fine-tuning, along with their fixes.

CUDA out of memory

Reduce per_device_train_batch_size to 1 and increase gradient_accumulation_steps to compensate. If that is still too large, reduce max_seq_length. As a last resort, drop the MLP target modules from the LoRA config to reduce adapter memory.

Loss not decreasing

Check your data formatting first. The most common cause is using "assistant" instead of "model" for the response role. Verify that your JSONL has the correct three-role structure (system, user, model). If the data is correct, increase the learning rate to 3e-4 or increase the LoRA rank.

Model outputs garbage after training

This usually means overfitting. Reduce epochs, add more training data, or increase dropout. Also check that the training data does not contain formatting artifacts (extra newlines, HTML tags, encoding issues) that the model memorized.

Do not mix "assistant" and "model" roles
Gemma expects the response role to be "model," not "assistant." Training with the wrong role token will produce a model that generates text under the wrong special token, degrading output quality at inference.

Unsloth installation fails

Unsloth requires specific CUDA and PyTorch version combinations. Check the Unsloth compatibility matrix for your CUDA version. The most common fix is installing the correct PyTorch nightly build before installing Unsloth.


FAQ

Can I fine-tune Gemma on Apple Silicon (M1/M2/M3)?

Unsloth requires NVIDIA CUDA, so it does not run on Apple Silicon directly. However, you can use the standard Hugging Face stack with MPS (Metal Performance Shaders) for small models like E2B. Training will be significantly slower than CUDA. For anything larger than E4B, use a cloud GPU (Colab, RunPod, Lambda) instead.

How long does fine-tuning take?

Highly variable. A typical QLoRA run with 1,000 examples on Gemma E4B using an RTX 4090 finishes in 15 to 30 minutes with Unsloth. The same run on an RTX 3070 takes roughly 45 to 90 minutes. Larger models and datasets scale linearly.

Can I combine fine-tuning with RAG?

Yes, and this is often the ideal production setup. Fine-tune for consistent formatting, tone, and task behavior. Use RAG to inject up-to-date facts at inference time. The fine-tuned model learns how to use retrieved context effectively, while RAG prevents hallucination on factual queries.

Do I need to fine-tune again when a new Gemma version releases?

LoRA adapters are tied to the specific model architecture they were trained on. If the new version changes the layer dimensions or architecture, you need to retrain. If it is the same architecture with updated base weights, you can sometimes reuse the adapter, but retraining typically yields better results since the new base weights shift the loss landscape.

Your Progress
1
Load model in 4-bit
FastModel.from_pretrained with load_in_4bit=True
2
Attach LoRA adapters
Rank 16, alpha 32, target all linear layers
3
Prepare training data
JSONL with system/user/model roles
4
Train with SFTTrainer
3 epochs, lr 2e-4, cosine scheduler
5
Export and deploy
Adapter, merged model, or GGUF for Ollama

Fact-checked against vendor documentation and official sources, May 2026
Gemma is a trademark of Google LLC. Hugging Face is a trademark of Hugging Face, Inc. Unsloth is a trademark of Unsloth AI. NVIDIA, CUDA, RTX, and A6000 are trademarks of NVIDIA Corporation. Ollama is a trademark of Ollama, Inc. All trademarks belong to their respective owners. Tech Jacks Solutions is not affiliated with or endorsed by any of these companies.
Before You Use AI
Your Privacy

When fine-tuning any AI model, your training data is processed locally on your hardware or on cloud GPU instances you provision. Google does not receive your training data when using open-weight Gemma models downloaded from Hugging Face. However, if you use Google Vertex AI for fine-tuning, data processing is governed by Google Cloud's data processing terms. Review your chosen platform's privacy policy before uploading sensitive datasets.

Mental Health & AI Dependency

AI tools are designed to augment human capabilities, not replace human judgment. Avoid over-reliance on AI-generated outputs for critical decisions. If you are experiencing distress:

  • 988 Suicide & Crisis Lifeline: Call or text 988
  • SAMHSA Helpline: 1-800-662-4357
  • Crisis Text Line: Text HOME to 741741

AI systems can produce plausible-sounding but incorrect guidance. For mental health, medical, legal, or financial decisions, always consult a qualified professional.

Your Rights & Our Transparency

Under GDPR and CCPA, you have the right to access, correct, and delete your personal data. This article reflects independent editorial analysis and is not sponsored by Google, Hugging Face, or Unsloth. Some links may be affiliate links; this does not influence our recommendations. The EU AI Act establishes risk-based obligations for AI systems that may apply to fine-tuned models deployed in production.