What hardware do I need to run Gemma locally?

It depends on the model. Gemma E2B runs on about 3 GB of VRAM with 4-bit quantization (RTX 3060 or T4). The 26B MoE variant needs 14-16 GB (RTX 4090 or A5000), and the 31B Dense model requires 16-22 GB.

How do I install Gemma with Ollama?

Install Ollama from ollama.com, then run ollama run gemma3 in your terminal. Ollama downloads the GGUF-quantized model automatically.

What is the difference between Gemma 3 and Gemma 3n?

Gemma 3n (E2B and E4B) targets efficiency on smaller hardware. The E4B variant supports 128K context and audio input. Gemma 3 27B comes in MoE and Dense configurations for higher capability tasks.

Can I fine-tune Gemma on a single GPU?

Yes. QLoRA with Unsloth lets you fine-tune Gemma on a single consumer GPU. It reduces memory usage by about 70% and trains roughly 2x faster than standard fine-tuning.

Is Gemma free to use?

Yes. All Gemma models are released under the Apache 2.0 license. You can download, modify, fine-tune, and deploy them commercially without fees.

Google Gemma

How to Use Gemma: Local Setup, API & Fine-Tuning Guide

Gemma is Google's open-weight model family released under the Apache 2.0 license. This guide walks through every step: picking a model, running it locally, connecting via API, and fine-tuning with QLoRA. No cloud credits required for local setups.

RTX 4090

Runs 26B MoE (4-bit)

Google Gemma docs

Apache 2.0

Fully open license

Gemma license

Deployment tools

Ollama, llama.cpp, HF, vLLM, MLX

200–50K

Training examples needed

Unsloth docs

Prerequisites

Before you start, make sure you have these basics covered. Click each item to mark it done.

Setup Checklist

✓

Python 3.10+ installed (3.11 recommended). Verify with python --version.

✓

NVIDIA GPU with CUDA (for local inference). CPU works for E2B but is slow. Minimum RTX 3060 / T4 for 4-bit quantized models.

✓

Virtual environment created. Run python -m venv gemma-env and activate it.

✓

Ollama installed (optional, for easiest local path). Download from ollama.com/download.

✓

Hugging Face account (optional, for Transformers path). Sign up at huggingface.co/join.

0 of 5 complete

Your Progress

0 of 7 steps complete

✓Choose your Gemma model
✓Run locally with Ollama
✓Set up Python + Transformers
✓Connect via API
✓Fine-tune with QLoRA
✓Export your model
✓Plan hardware budget

Choose Your Model

Gemma ships in four primary configurations under the Gemma 3 and Gemma 3n families. The right choice depends on your hardware and workload. All models use the Apache 2.0 license, so there are no commercial restrictions.

Model	Parameters	Context	VRAM (4-bit)	Best For
Gemma 4 E2B	2B	8K	~3 GB	Edge devices, rapid prototyping, resource-constrained environments
Gemma 4 E4B	4B	128K	~6 GB	Long-context tasks, audio input, balanced performance
Gemma 4 26B MoE	26B (MoE)	128K	~14–16 GB	Production workloads, high throughput, efficiency at scale
Gemma 4 31B Dense	31B (Dense)	256K	~16–22 GB	Maximum capability, long-document analysis, research

Quick pick: Start with E4B if you have a modern gaming GPU (RTX 3070+). It handles most tasks well and supports 128K context. Move to 27B MoE when you need stronger reasoning or multi-turn conversation quality.

Local Setup with Ollama

Ollama is the fastest path from zero to running Gemma. It handles model downloading, quantization, and serving in a single binary. No Python required.

Install and run

After installing Ollama from ollama.com, open your terminal:

# Pull and run Gemma 3 (defaults to the 4B variant)
ollama run gemma3

# Run a specific size
ollama run gemma3:2b
ollama run gemma3:27b

# Run in server mode for API access
ollama serve

That single ollama run command downloads the GGUF-quantized model (usually Q4_K_M) and starts an interactive chat. First run takes a few minutes depending on your connection; subsequent launches are instant.

Use the Ollama API

With ollama serve running in the background, you get a local REST API on port 11434:

# Chat completion (cURL)
curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [{"role": "user", "content": "Explain QLoRA in 3 sentences."}]
}'

# Generate (streaming)
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:27b",
  "prompt": "Write a Python function to parse CSV files."
}'

The Ollama API is OpenAI-compatible, so most client libraries work out of the box. Point your base_url to http://localhost:11434/v1 and use gemma3 as the model name.

1 cmd

From install to running Gemma locally. No Python, no API keys, no cloud dependencies.

FREE TEMPLATE

AI Risk Management Template

Identify, assess, and mitigate AI deployment risks

Download Free →

Python Setup with Transformers

For programmatic control, fine-tuning, or integration into Python applications, use the Hugging Face Transformers library. This path gives you full access to model internals.

Install dependencies

# Core stack
pip install transformers torch accelerate

# Optional: multimodal support
pip install torchvision   # images
pip install librosa        # audio (E4B)
pip install torchcodec     # video

Load and run inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-3-4b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is retrieval-augmented generation?"}
]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The device_map="auto" argument handles GPU placement automatically. If your model does not fit in VRAM, it spills to CPU RAM (slower but functional).

4-bit quantized loading

To run larger models on limited hardware, load in 4-bit:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-27b-it",
    quantization_config=bnb_config,
    device_map="auto"
)

The bitsandbytes library (needed for 4-bit loading) has limited Windows support. On Windows, use WSL2 or switch to the Ollama path for quantized inference.

Running via API

If you want to skip local setup entirely, Gemma is available through several hosted APIs. These options trade hardware costs for per-request pricing.

Google AI Studio

Google AI Studio provides free-tier access to Gemma models with a playground interface and API keys. It supports all Gemma variants and is the simplest way to experiment without any local installation.

Hugging Face Inference API

Hugging Face hosts Gemma models on their serverless inference infrastructure. The free tier handles basic experimentation; the Pro plan ($9/month) increases rate limits:

from huggingface_hub import InferenceClient

client = InferenceClient(token="hf_YOUR_TOKEN")
response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[{"role": "user", "content": "Summarize this article."}],
    max_tokens=256
)
print(response.choices[0].message.content)

Third-party providers

Several cloud providers offer Gemma endpoints with OpenAI-compatible APIs: Together AI, Fireworks AI, and Groq among others. Check each provider's model catalog for available Gemma sizes and pricing.

Practitioner note: For production workloads that need predictable latency, self-host with vLLM or SGLang. API providers work well for prototyping but introduce a third-party dependency and variable response times.

Fine-Tuning with QLoRA

QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune Gemma on a single consumer GPU. Combined with Unsloth, you get roughly 2x faster training and 70% less memory usage compared to standard fine-tuning.

70%

Memory reduction with QLoRA + Unsloth. Fine-tune Gemma 4B on an RTX 3060 with 6 GB VRAM.

Step 1: Load model in 4-bit

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it",
    max_seq_length=4096,
    load_in_4bit=True
)

Step 2: Attach LoRA adapters

model = FastModel.get_peft_model(
    model,
    r=16,                # LoRA rank (16-64 recommended)
    lora_alpha=32,       # Alpha = 2x rank
    target_modules=[
        "q_proj", "k_proj", "v_proj",
        "o_proj", "gate_proj",
        "up_proj", "down_proj"
    ],
    lora_dropout=0.05
)

Step 3: Format your training data

Gemma uses "model" as the assistant role (not "assistant"). This is a common gotcha that causes silent training failures.

# Correct format for Gemma
training_example = {
    "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a list."},
        {"role": "model", "content": "def reverse_list(lst):\n    return lst[::-1]"}
    ]
}

Step 4: Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./gemma-finetuned",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_steps=10,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch"
    )
)
trainer.train()

Step 5: How much data do you need?

Task Type	Examples Needed	Typical Use Case
Style transfer	200–1,000	Match a specific writing voice or tone
Task-specific	500–5,000	Classification, extraction, structured output
Domain adaptation	10,000–50,000	Specialized vocabulary, industry knowledge

Gemma uses "model" for assistant responses, not "assistant". Using the wrong role name compiles without error but degrades output quality. Double-check your data formatting.

Stay within the 1e-4 to 3e-4 range for learning rate. Values above 5e-4 tend to destabilize training with QLoRA. Start at 2e-4 and adjust based on your loss curve.

Exporting Your Model

After fine-tuning, you have three export options depending on where you plan to deploy:

Option A: LoRA adapter only

Saves just the fine-tuned weights (typically 50–200 MB). Load the base model separately and apply the adapter at inference time. Best when you need to switch between multiple fine-tunes on the same base.

# Save LoRA adapter
model.save_pretrained("./gemma-lora-adapter")
tokenizer.save_pretrained("./gemma-lora-adapter")

# Load later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model = PeftModel.from_pretrained(base_model, "./gemma-lora-adapter")

Option B: Merged model

Merges the adapter back into the base model weights. Produces a standalone model that does not require PEFT at inference. Useful for deployment to environments where you want a single model directory.

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./gemma-merged")
tokenizer.save_pretrained("./gemma-merged")

Option C: GGUF export (for Ollama / llama.cpp)

Converts to GGUF format with your choice of quantization. Deploy locally with Ollama or any llama.cpp-compatible tool:

# Using Unsloth's built-in export
model.save_pretrained_gguf(
    "./gemma-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

# Then in Ollama:
# ollama create my-gemma -f ./Modelfile

Hardware Planning

Match your GPU to the model and task. These numbers assume 4-bit quantization (QLoRA / GGUF Q4_K_M), which is what most practitioners use for local work.

GPU	VRAM	Inference	Fine-Tune (QLoRA)
RTX 3060 / T4	12–16 GB	E2B, E4B	E2B
RTX 3070 / 3080	8–12 GB	E2B, E4B	E2B, E4B (tight)
RTX 4090 / A5000	24 GB	All models	E2B, E4B, 26B MoE
A6000 / A100 40GB	40–48 GB	All models	All models
H100 / A100 80GB	80 GB	All models (fp16)	All models (full fine-tune)

Cloud alternative: If you do not own a GPU, Google Colab (free tier with T4) handles E2B and E4B inference. For 27B models, Colab Pro ($10/month for A100 access) or Lambda Labs ($1.10/hr for A100 40GB) are cost-effective starting points.

Frequently Asked Questions

CUDA out of memory when loading 27B+

Load in 4-bit with BitsAndBytes or use the Ollama GGUF path. If using Transformers, add quantization_config=BitsAndBytesConfig(load_in_4bit=True) and device_map="auto". Close other GPU processes (browsers, game launchers) that may be holding VRAM.

Ollama says "model not found"+

Check your spelling. The correct name is gemma3 (no space, no hyphen). Run ollama list to see all downloaded models. If you need a specific size, append the tag: ollama run gemma3:27b.

Training loss is not decreasing+

Check three things: (1) Verify your data uses "model" as the role, not "assistant". (2) Confirm learning rate is between 1e-4 and 3e-4. (3) Make sure you have at least 200 training examples. If loss is flat from the start, your data formatting is likely wrong.

How do I run Gemma on Apple Silicon?+

Two options: (1) Ollama natively supports Apple Silicon and uses Metal for GPU acceleration. Just ollama run gemma3. (2) MLX, Apple's machine learning framework, supports Gemma models optimized for M1/M2/M3 chips. Check mlx-examples on GitHub for Gemma-specific configurations.

Can I use Gemma for commercial products?+

Yes. Gemma is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. There are no output usage restrictions or mandatory attribution requirements beyond the license terms. This applies to all Gemma variants, including fine-tuned derivatives you create.

Video Resources

Google Gemma 3 Full Walkthrough

YouTube Search

Official demo covering model variants, features, and deployment options.

Fine-tune Gemma with Unsloth

YouTube Search

Practical QLoRA fine-tuning tutorial with code walkthrough and memory benchmarks.

Run Gemma Locally with Ollama

YouTube Search

Step-by-step local setup, model selection, and API usage patterns.

Go Deeper

Resources from across Tech Jacks Solutions

FREEAI Risk Management Template

Identify, assess, and mitigate AI deployment risks

EU AI Act Guide

Check your compliance obligations under the EU AI Act

FREEAI Bias Assessment

Evaluate bias risks before deploying any AI system

What Is Agentic AI?

Understand the architecture behind autonomous AI agents

AI Career Paths

Explore roles that work with these tools daily

Verified May 2026 | grounded from GEMMA notebook (571543bd), 72 sources

Gemma is a trademark of Google LLC. This article is an independent editorial publication by Tech Jacks Solutions. We are not affiliated with Google. All trademarks are property of their respective owners.

Gallery

Contacts

How to Use Gemma: Local Setup, API & Fine-Tuning Guide

Prerequisites

Choose Your Model

Local Setup with Ollama

Install and run

Use the Ollama API

Python Setup with Transformers

Install dependencies

Load and run inference

4-bit quantized loading

Running via API

Google AI Studio

Hugging Face Inference API

Third-party providers

Fine-Tuning with QLoRA

Step 1: Load model in 4-bit

Step 2: Attach LoRA adapters

Step 3: Format your training data

Step 4: Train

Step 5: How much data do you need?

Exporting Your Model

Option A: LoRA adapter only

Option B: Merged model

Option C: GGUF export (for Ollama / llama.cpp)

Hardware Planning

Frequently Asked Questions

Video Resources

Go Deeper

Services

Learn

Company