How to Use Gemma: Local Setup, API & Fine-Tuning Guide
Gemma is Google's open-weight model family released under the Apache 2.0 license. This guide walks through every step: picking a model, running it locally, connecting via API, and fine-tuning with QLoRA. No cloud credits required for local setups.
Prerequisites
Before you start, make sure you have these basics covered. Click each item to mark it done.
python --version.python -m venv gemma-env and activate it.- ✓Choose your Gemma model
- ✓Run locally with Ollama
- ✓Set up Python + Transformers
- ✓Connect via API
- ✓Fine-tune with QLoRA
- ✓Export your model
- ✓Plan hardware budget
Choose Your Model
Gemma ships in four primary configurations under the Gemma 3 and Gemma 3n families. The right choice depends on your hardware and workload. All models use the Apache 2.0 license, so there are no commercial restrictions.
| Model | Parameters | Context | VRAM (4-bit) | Best For |
|---|---|---|---|---|
| Gemma 4 E2B | 2B | 8K | ~3 GB | Edge devices, rapid prototyping, resource-constrained environments |
| Gemma 4 E4B | 4B | 128K | ~6 GB | Long-context tasks, audio input, balanced performance |
| Gemma 4 26B MoE | 26B (MoE) | 128K | ~14–16 GB | Production workloads, high throughput, efficiency at scale |
| Gemma 4 31B Dense | 31B (Dense) | 256K | ~16–22 GB | Maximum capability, long-document analysis, research |
Quick pick: Start with E4B if you have a modern gaming GPU (RTX 3070+). It handles most tasks well and supports 128K context. Move to 27B MoE when you need stronger reasoning or multi-turn conversation quality.
Local Setup with Ollama
Ollama is the fastest path from zero to running Gemma. It handles model downloading, quantization, and serving in a single binary. No Python required.
Install and run
After installing Ollama from ollama.com, open your terminal:
# Pull and run Gemma 3 (defaults to the 4B variant)
ollama run gemma3
# Run a specific size
ollama run gemma3:2b
ollama run gemma3:27b
# Run in server mode for API access
ollama serve
That single ollama run command downloads the GGUF-quantized model (usually Q4_K_M) and starts an interactive chat. First run takes a few minutes depending on your connection; subsequent launches are instant.
Use the Ollama API
With ollama serve running in the background, you get a local REST API on port 11434:
# Chat completion (cURL)
curl http://localhost:11434/api/chat -d '{
"model": "gemma3",
"messages": [{"role": "user", "content": "Explain QLoRA in 3 sentences."}]
}'
# Generate (streaming)
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:27b",
"prompt": "Write a Python function to parse CSV files."
}'
The Ollama API is OpenAI-compatible, so most client libraries work out of the box. Point your base_url to http://localhost:11434/v1 and use gemma3 as the model name.
AI Risk Management Template
Identify, assess, and mitigate AI deployment risks
Download Free →Python Setup with Transformers
For programmatic control, fine-tuning, or integration into Python applications, use the Hugging Face Transformers library. This path gives you full access to model internals.
Install dependencies
# Core stack
pip install transformers torch accelerate
# Optional: multimodal support
pip install torchvision # images
pip install librosa # audio (E4B)
pip install torchcodec # video
Load and run inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-3-4b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "user", "content": "What is retrieval-augmented generation?"}
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The device_map="auto" argument handles GPU placement automatically. If your model does not fit in VRAM, it spills to CPU RAM (slower but functional).
4-bit quantized loading
To run larger models on limited hardware, load in 4-bit:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-27b-it",
quantization_config=bnb_config,
device_map="auto"
)
bitsandbytes library (needed for 4-bit loading) has limited Windows support. On Windows, use WSL2 or switch to the Ollama path for quantized inference.Running via API
If you want to skip local setup entirely, Gemma is available through several hosted APIs. These options trade hardware costs for per-request pricing.
Google AI Studio
Google AI Studio provides free-tier access to Gemma models with a playground interface and API keys. It supports all Gemma variants and is the simplest way to experiment without any local installation.
Hugging Face Inference API
Hugging Face hosts Gemma models on their serverless inference infrastructure. The free tier handles basic experimentation; the Pro plan ($9/month) increases rate limits:
from huggingface_hub import InferenceClient
client = InferenceClient(token="hf_YOUR_TOKEN")
response = client.chat.completions.create(
model="google/gemma-3-4b-it",
messages=[{"role": "user", "content": "Summarize this article."}],
max_tokens=256
)
print(response.choices[0].message.content)
Third-party providers
Several cloud providers offer Gemma endpoints with OpenAI-compatible APIs: Together AI, Fireworks AI, and Groq among others. Check each provider's model catalog for available Gemma sizes and pricing.
Practitioner note: For production workloads that need predictable latency, self-host with vLLM or SGLang. API providers work well for prototyping but introduce a third-party dependency and variable response times.
Fine-Tuning with QLoRA
QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune Gemma on a single consumer GPU. Combined with Unsloth, you get roughly 2x faster training and 70% less memory usage compared to standard fine-tuning.
Step 1: Load model in 4-bit
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name="unsloth/gemma-3-4b-it",
max_seq_length=4096,
load_in_4bit=True
)
Step 2: Attach LoRA adapters
model = FastModel.get_peft_model(
model,
r=16, # LoRA rank (16-64 recommended)
lora_alpha=32, # Alpha = 2x rank
target_modules=[
"q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj",
"up_proj", "down_proj"
],
lora_dropout=0.05
)
Step 3: Format your training data
Gemma uses "model" as the assistant role (not "assistant"). This is a common gotcha that causes silent training failures.
# Correct format for Gemma
training_example = {
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a list."},
{"role": "model", "content": "def reverse_list(lst):\n return lst[::-1]"}
]
}
Step 4: Train
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
output_dir="./gemma-finetuned",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
warmup_steps=10,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
)
trainer.train()
Step 5: How much data do you need?
| Task Type | Examples Needed | Typical Use Case |
|---|---|---|
| Style transfer | 200–1,000 | Match a specific writing voice or tone |
| Task-specific | 500–5,000 | Classification, extraction, structured output |
| Domain adaptation | 10,000–50,000 | Specialized vocabulary, industry knowledge |
"model" for assistant responses, not "assistant". Using the wrong role name compiles without error but degrades output quality. Double-check your data formatting.Exporting Your Model
After fine-tuning, you have three export options depending on where you plan to deploy:
Option A: LoRA adapter only
Saves just the fine-tuned weights (typically 50–200 MB). Load the base model separately and apply the adapter at inference time. Best when you need to switch between multiple fine-tunes on the same base.
# Save LoRA adapter
model.save_pretrained("./gemma-lora-adapter")
tokenizer.save_pretrained("./gemma-lora-adapter")
# Load later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model = PeftModel.from_pretrained(base_model, "./gemma-lora-adapter")
Option B: Merged model
Merges the adapter back into the base model weights. Produces a standalone model that does not require PEFT at inference. Useful for deployment to environments where you want a single model directory.
# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./gemma-merged")
tokenizer.save_pretrained("./gemma-merged")
Option C: GGUF export (for Ollama / llama.cpp)
Converts to GGUF format with your choice of quantization. Deploy locally with Ollama or any llama.cpp-compatible tool:
# Using Unsloth's built-in export
model.save_pretrained_gguf(
"./gemma-gguf",
tokenizer,
quantization_method="q4_k_m"
)
# Then in Ollama:
# ollama create my-gemma -f ./ModelfileHardware Planning
Match your GPU to the model and task. These numbers assume 4-bit quantization (QLoRA / GGUF Q4_K_M), which is what most practitioners use for local work.
| GPU | VRAM | Inference | Fine-Tune (QLoRA) |
|---|---|---|---|
| RTX 3060 / T4 | 12–16 GB | E2B, E4B | E2B |
| RTX 3070 / 3080 | 8–12 GB | E2B, E4B | E2B, E4B (tight) |
| RTX 4090 / A5000 | 24 GB | All models | E2B, E4B, 26B MoE |
| A6000 / A100 40GB | 40–48 GB | All models | All models |
| H100 / A100 80GB | 80 GB | All models (fp16) | All models (full fine-tune) |
Cloud alternative: If you do not own a GPU, Google Colab (free tier with T4) handles E2B and E4B inference. For 27B models, Colab Pro ($10/month for A100 access) or Lambda Labs ($1.10/hr for A100 40GB) are cost-effective starting points.
Frequently Asked Questions
quantization_config=BitsAndBytesConfig(load_in_4bit=True) and device_map="auto". Close other GPU processes (browsers, game launchers) that may be holding VRAM.gemma3 (no space, no hyphen). Run ollama list to see all downloaded models. If you need a specific size, append the tag: ollama run gemma3:27b."model" as the role, not "assistant". (2) Confirm learning rate is between 1e-4 and 3e-4. (3) Make sure you have at least 200 training examples. If loss is flat from the start, your data formatting is likely wrong.ollama run gemma3. (2) MLX, Apple's machine learning framework, supports Gemma models optimized for M1/M2/M3 chips. Check mlx-examples on GitHub for Gemma-specific configurations.Video Resources
Go Deeper
Resources from across Tech Jacks Solutions
FREEAI Risk Management Template
Identify, assess, and mitigate AI deployment risks
EU AI Act Guide
Check your compliance obligations under the EU AI Act
FREEAI Bias Assessment
Evaluate bias risks before deploying any AI system
What Is Agentic AI?
Understand the architecture behind autonomous AI agents
AI Career Paths
Explore roles that work with these tools daily