Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Google Gemma

How to Use Gemma: Local Setup, API & Fine-Tuning Guide

Gemma is Google's open-weight model family released under the Apache 2.0 license. This guide walks through every step: picking a model, running it locally, connecting via API, and fine-tuning with QLoRA. No cloud credits required for local setups.

RTX 4090
Runs 26B MoE (4-bit)
Apache 2.0
Fully open license
5+
Deployment tools
Ollama, llama.cpp, HF, vLLM, MLX
200–50K
Training examples needed

Prerequisites

Before you start, make sure you have these basics covered. Click each item to mark it done.

Setup Checklist
Python 3.10+ installed (3.11 recommended). Verify with python --version.
NVIDIA GPU with CUDA (for local inference). CPU works for E2B but is slow. Minimum RTX 3060 / T4 for 4-bit quantized models.
Virtual environment created. Run python -m venv gemma-env and activate it.
Ollama installed (optional, for easiest local path). Download from ollama.com/download.
Hugging Face account (optional, for Transformers path). Sign up at huggingface.co/join.
0 of 5 complete

Your Progress
0 of 7 steps complete
  • Choose your Gemma model
  • Run locally with Ollama
  • Set up Python + Transformers
  • Connect via API
  • Fine-tune with QLoRA
  • Export your model
  • Plan hardware budget

Choose Your Model

Gemma ships in four primary configurations under the Gemma 3 and Gemma 3n families. The right choice depends on your hardware and workload. All models use the Apache 2.0 license, so there are no commercial restrictions.

ModelParametersContextVRAM (4-bit)Best For
Gemma 4 E2B2B8K~3 GBEdge devices, rapid prototyping, resource-constrained environments
Gemma 4 E4B4B128K~6 GBLong-context tasks, audio input, balanced performance
Gemma 4 26B MoE26B (MoE)128K~14–16 GBProduction workloads, high throughput, efficiency at scale
Gemma 4 31B Dense31B (Dense)256K~16–22 GBMaximum capability, long-document analysis, research

Quick pick: Start with E4B if you have a modern gaming GPU (RTX 3070+). It handles most tasks well and supports 128K context. Move to 27B MoE when you need stronger reasoning or multi-turn conversation quality.


Local Setup with Ollama

Ollama is the fastest path from zero to running Gemma. It handles model downloading, quantization, and serving in a single binary. No Python required.

Install and run

After installing Ollama from ollama.com, open your terminal:

# Pull and run Gemma 3 (defaults to the 4B variant)
ollama run gemma3

# Run a specific size
ollama run gemma3:2b
ollama run gemma3:27b

# Run in server mode for API access
ollama serve

That single ollama run command downloads the GGUF-quantized model (usually Q4_K_M) and starts an interactive chat. First run takes a few minutes depending on your connection; subsequent launches are instant.

Use the Ollama API

With ollama serve running in the background, you get a local REST API on port 11434:

# Chat completion (cURL)
curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [{"role": "user", "content": "Explain QLoRA in 3 sentences."}]
}'

# Generate (streaming)
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:27b",
  "prompt": "Write a Python function to parse CSV files."
}'

The Ollama API is OpenAI-compatible, so most client libraries work out of the box. Point your base_url to http://localhost:11434/v1 and use gemma3 as the model name.

1 cmd
From install to running Gemma locally. No Python, no API keys, no cloud dependencies.

FREE TEMPLATE

AI Risk Management Template

Identify, assess, and mitigate AI deployment risks

Download Free →

Python Setup with Transformers

For programmatic control, fine-tuning, or integration into Python applications, use the Hugging Face Transformers library. This path gives you full access to model internals.

Install dependencies

# Core stack
pip install transformers torch accelerate

# Optional: multimodal support
pip install torchvision   # images
pip install librosa        # audio (E4B)
pip install torchcodec     # video

Load and run inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-3-4b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is retrieval-augmented generation?"}
]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The device_map="auto" argument handles GPU placement automatically. If your model does not fit in VRAM, it spills to CPU RAM (slower but functional).

4-bit quantized loading

To run larger models on limited hardware, load in 4-bit:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-27b-it",
    quantization_config=bnb_config,
    device_map="auto"
)
BitsAndBytes requires Linux
The bitsandbytes library (needed for 4-bit loading) has limited Windows support. On Windows, use WSL2 or switch to the Ollama path for quantized inference.

Running via API

If you want to skip local setup entirely, Gemma is available through several hosted APIs. These options trade hardware costs for per-request pricing.

Google AI Studio

Google AI Studio provides free-tier access to Gemma models with a playground interface and API keys. It supports all Gemma variants and is the simplest way to experiment without any local installation.

Hugging Face Inference API

Hugging Face hosts Gemma models on their serverless inference infrastructure. The free tier handles basic experimentation; the Pro plan ($9/month) increases rate limits:

from huggingface_hub import InferenceClient

client = InferenceClient(token="hf_YOUR_TOKEN")
response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[{"role": "user", "content": "Summarize this article."}],
    max_tokens=256
)
print(response.choices[0].message.content)

Third-party providers

Several cloud providers offer Gemma endpoints with OpenAI-compatible APIs: Together AI, Fireworks AI, and Groq among others. Check each provider's model catalog for available Gemma sizes and pricing.

Practitioner note: For production workloads that need predictable latency, self-host with vLLM or SGLang. API providers work well for prototyping but introduce a third-party dependency and variable response times.


Fine-Tuning with QLoRA

QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune Gemma on a single consumer GPU. Combined with Unsloth, you get roughly 2x faster training and 70% less memory usage compared to standard fine-tuning.

70%
Memory reduction with QLoRA + Unsloth. Fine-tune Gemma 4B on an RTX 3060 with 6 GB VRAM.

Step 1: Load model in 4-bit

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it",
    max_seq_length=4096,
    load_in_4bit=True
)

Step 2: Attach LoRA adapters

model = FastModel.get_peft_model(
    model,
    r=16,                # LoRA rank (16-64 recommended)
    lora_alpha=32,       # Alpha = 2x rank
    target_modules=[
        "q_proj", "k_proj", "v_proj",
        "o_proj", "gate_proj",
        "up_proj", "down_proj"
    ],
    lora_dropout=0.05
)

Step 3: Format your training data

Gemma uses "model" as the assistant role (not "assistant"). This is a common gotcha that causes silent training failures.

# Correct format for Gemma
training_example = {
    "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a list."},
        {"role": "model", "content": "def reverse_list(lst):\n    return lst[::-1]"}
    ]
}

Step 4: Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./gemma-finetuned",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_steps=10,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch"
    )
)
trainer.train()

Step 5: How much data do you need?

Task TypeExamples NeededTypical Use Case
Style transfer200–1,000Match a specific writing voice or tone
Task-specific500–5,000Classification, extraction, structured output
Domain adaptation10,000–50,000Specialized vocabulary, industry knowledge
Role name matters
Gemma uses "model" for assistant responses, not "assistant". Using the wrong role name compiles without error but degrades output quality. Double-check your data formatting.
Learning rate sensitivity
Stay within the 1e-4 to 3e-4 range for learning rate. Values above 5e-4 tend to destabilize training with QLoRA. Start at 2e-4 and adjust based on your loss curve.

Exporting Your Model

After fine-tuning, you have three export options depending on where you plan to deploy:

Option A: LoRA adapter only

Saves just the fine-tuned weights (typically 50–200 MB). Load the base model separately and apply the adapter at inference time. Best when you need to switch between multiple fine-tunes on the same base.

# Save LoRA adapter
model.save_pretrained("./gemma-lora-adapter")
tokenizer.save_pretrained("./gemma-lora-adapter")

# Load later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model = PeftModel.from_pretrained(base_model, "./gemma-lora-adapter")

Option B: Merged model

Merges the adapter back into the base model weights. Produces a standalone model that does not require PEFT at inference. Useful for deployment to environments where you want a single model directory.

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./gemma-merged")
tokenizer.save_pretrained("./gemma-merged")

Option C: GGUF export (for Ollama / llama.cpp)

Converts to GGUF format with your choice of quantization. Deploy locally with Ollama or any llama.cpp-compatible tool:

# Using Unsloth's built-in export
model.save_pretrained_gguf(
    "./gemma-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

# Then in Ollama:
# ollama create my-gemma -f ./Modelfile

Hardware Planning

Match your GPU to the model and task. These numbers assume 4-bit quantization (QLoRA / GGUF Q4_K_M), which is what most practitioners use for local work.

GPUVRAMInferenceFine-Tune (QLoRA)
RTX 3060 / T412–16 GBE2B, E4BE2B
RTX 3070 / 30808–12 GBE2B, E4BE2B, E4B (tight)
RTX 4090 / A500024 GBAll modelsE2B, E4B, 26B MoE
A6000 / A100 40GB40–48 GBAll modelsAll models
H100 / A100 80GB80 GBAll models (fp16)All models (full fine-tune)

Cloud alternative: If you do not own a GPU, Google Colab (free tier with T4) handles E2B and E4B inference. For 27B models, Colab Pro ($10/month for A100 access) or Lambda Labs ($1.10/hr for A100 40GB) are cost-effective starting points.


Frequently Asked Questions

CUDA out of memory when loading 27B+
Load in 4-bit with BitsAndBytes or use the Ollama GGUF path. If using Transformers, add quantization_config=BitsAndBytesConfig(load_in_4bit=True) and device_map="auto". Close other GPU processes (browsers, game launchers) that may be holding VRAM.
Ollama says "model not found"+
Check your spelling. The correct name is gemma3 (no space, no hyphen). Run ollama list to see all downloaded models. If you need a specific size, append the tag: ollama run gemma3:27b.
Training loss is not decreasing+
Check three things: (1) Verify your data uses "model" as the role, not "assistant". (2) Confirm learning rate is between 1e-4 and 3e-4. (3) Make sure you have at least 200 training examples. If loss is flat from the start, your data formatting is likely wrong.
How do I run Gemma on Apple Silicon?+
Two options: (1) Ollama natively supports Apple Silicon and uses Metal for GPU acceleration. Just ollama run gemma3. (2) MLX, Apple's machine learning framework, supports Gemma models optimized for M1/M2/M3 chips. Check mlx-examples on GitHub for Gemma-specific configurations.
Can I use Gemma for commercial products?+
Yes. Gemma is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. There are no output usage restrictions or mandatory attribution requirements beyond the license terms. This applies to all Gemma variants, including fine-tuned derivatives you create.

Video Resources

Verified May 2026 | grounded from GEMMA notebook (571543bd), 72 sources
Gemma is a trademark of Google LLC. This article is an independent editorial publication by Tech Jacks Solutions. We are not affiliated with Google. All trademarks are property of their respective owners.
Before You Use AI
Your Privacy
Gemma models downloaded and run locally do not send data to Google or any third party. Your prompts, outputs, and training data stay on your hardware. When accessing Gemma through Google AI Studio or the Hugging Face Inference API, input data is processed by those providers' infrastructure.
Review the privacy policies of any hosted service you use. Enterprise deployments should evaluate data residency requirements before selecting a hosting provider.
Mental Health & AI Dependency
Gemma is an open-weight language model that generates text based on statistical patterns. Its outputs can appear authoritative while being factually incorrect. Over-reliance on any AI system without human verification creates risk, especially in high-stakes domains. If you are experiencing distress:
  • 988 Suicide & Crisis Lifeline: Call or text 988
  • SAMHSA Helpline: 1-800-662-4357
  • Crisis Text Line: Text HOME to 741741
AI systems can produce plausible-sounding but incorrect guidance. For mental health, medical, legal, or financial decisions, always consult a qualified professional.
Your Rights & Our Transparency
Under GDPR (EU) and CCPA (California), you have the right to access, correct, and delete personal data processed by AI systems. Open-weight models like Gemma can be audited directly, but fine-tuned derivatives may introduce new biases from training data.
This article is an independent editorial publication by Tech Jacks Solutions. We are not affiliated with Google LLC. Our analysis is based on publicly available documentation and verified testing. The EU AI Act establishes risk-based classification requirements for AI systems deployed in the European Union.