What Is Google Gemma? Open Models, Benchmarks & How to Use It
The short version: Google Gemma is DeepMind's open-weight model family, built from the same research that produces Gemini. The latest generation, Gemma 4, ships four models under the Apache 2.0 license, ranging from a 2-billion-parameter edge model to a 31-billion-parameter dense transformer that ranks #3 among open models on the Arena AI text leaderboard. The standout is the 26B MoE variant, which routes each token through just 3.8 billion of its 26 billion total parameters, delivering large-model intelligence at a fraction of the compute cost.
What Is Google Gemma?
Gemma is Google DeepMind's family of open-weight language models. The name comes from the Latin word for "gemstone," and the connection to Gemini is more than cosmetic. Gemma models are distilled from the same research infrastructure and training data pipelines that produce Google's flagship Gemini models. The difference is in the delivery: where Gemini is a closed API product, Gemma ships as downloadable weights under the Apache 2.0 license.
That distinction matters for practitioners. Apache 2.0 means no usage caps, no monthly active user limits, no mandatory attribution, and no restrictions on commercial deployment. You can fine-tune Gemma, merge it with other models, deploy it on your own infrastructure, and sell products built on top of it. The license is identical to what you would find on TensorFlow or Kubernetes.
Since the first Gemma models arrived in February 2024, the family has grown to cover everything from 2-billion-parameter edge models that run on mobile phones to 31-billion-parameter dense transformers that compete with models five times their size on reasoning benchmarks. The community has responded: Gemma has surpassed 150 million downloads on Hugging Face, with over 70,000 community-created variants ranging from medical assistants to code generators to safety classifiers.
The Gemma 4 Model Family
Released in April 2026, Gemma 4 represents the fourth generation and the most significant architectural leap in the family's history. Google DeepMind shipped four models simultaneously, each targeting a different compute envelope:
The edge models (E2B and E4B) are the only variants that support audio input. The larger 26B and 31B models handle text, image, and video but not audio. All four models share the same tokenizer and the same core transformer block design, which means techniques that work on one transfer cleanly to the others.
How Mixture-of-Experts Works in Gemma 4
The Gemma 4 26B MoE is the most architecturally interesting model in the family. Traditional dense transformers pass every token through every parameter. A 26-billion-parameter dense model would require 26 billion operations per token. The MoE approach changes this equation fundamentally.
Gemma 4 26B contains 128 expert sub-networks. For each token, a learned router selects only the most relevant experts. The result: just 3.8 billion parameters activate per token, while the full 26 billion remain available as specialized knowledge pools. This is not an approximation or a quality trade-off. The model achieves roughly the same intelligence as a 27-billion-parameter dense model while consuming compute equivalent to a 4-billion-parameter model.
The practical impact is significant. Inference costs drop by roughly 7x compared to an equivalently capable dense model. Memory requirements during inference scale with the active parameter count, not the total. A quantized version of the 26B MoE fits in 16GB of VRAM, making it accessible to anyone with a mid-range GPU.
Why this matters for deployment: The 26B MoE ranks #6 among open models on the Arena AI text leaderboard while costing less to serve than many models half its total size. For production workloads where you need strong reasoning but cannot afford 80GB GPU servers, this is the model to benchmark first.
AI Risk Management Template
Identify, assess, and mitigate AI deployment risks
Download Free →Benchmark Performance
Gemma 4 31B Dense posts competitive scores across the four benchmarks that matter most for evaluating reasoning, math, coding, and scientific knowledge. These are vendor-reported numbers from the Google DeepMind technical report published in April 2026:
Gemma 3 27B also holds its own on the Chatbot Arena leaderboard with an Elo score of 1338, reaching 98% of DeepSeek R1's score despite being a fraction of the size. Arena scores measure real-world preference through blind human evaluation, which tends to be a more reliable signal than static benchmarks alone.
Multimodal Capabilities
Every Gemma 4 model processes multiple input types natively. This is not a bolted-on vision adapter; the multimodal capability is trained into the model from the start.
| Modality | E2B | E4B | 26B MoE | 31B Dense |
|---|---|---|---|---|
| Text | Yes | Yes | Yes | Yes |
| Image | Yes | Yes | Yes | Yes |
| Video | Yes | Yes | Yes | Yes |
| Audio | Yes | Yes | No | No |
| Context Length | 128K | 128K | 256K | 256K |
Audio support on the edge models is a deliberate design choice. E2B and E4B are intended for on-device applications where voice interaction is a primary interface: smart speakers, wearables, mobile assistants. The larger models focus on document understanding and video analysis workflows where audio processing is typically handled by a dedicated pipeline stage.
Licensing & Cost
Gemma uses the Apache 2.0 license, which is the most permissive widely-used open-source license. There are no restrictions on commercial use, no user caps, no mandatory attribution (though it is appreciated), and no requirement to open-source derivative works. You can fine-tune Gemma, quantize it, merge it, deploy it behind a paid API, and sell products built on it.
For teams that want managed API access without running their own infrastructure, Google AI Studio offers Gemma models at competitive rates. The Gemma 3 4B model prices at $0.02 per million input tokens and $0.04 per million output tokens, making it roughly 10x cheaper than Llama 3.1 70B on comparable hosted platforms.
Self-hosting eliminates per-token costs entirely. You pay only for the GPU compute to run inference. With quantized models and efficient serving frameworks like vLLM, a single consumer GPU can serve Gemma at meaningful throughput for batch workloads.
Gemma vs Llama
Gemma and Llama are the two dominant open model families, and they make fundamentally different architectural bets. Gemma uses a deeper, thinner architecture with more transformer layers and smaller hidden dimensions. Llama goes wider and shallower, with fewer layers but larger hidden dimensions. Neither approach is universally better, but they produce different performance profiles.
The licensing difference is equally significant. Gemma's Apache 2.0 license has no usage restrictions. Llama's community license includes a 700 million monthly active user cap, above which you need a separate commercial agreement with Meta. For most organizations this cap is irrelevant, but for platform companies or widely-distributed applications, it is a real constraint that Apache 2.0 avoids entirely.
When to Pick Gemma
- You need the smallest possible model for edge deployment (Gemma's 2B and 4B variants)
- Your application needs multimodal input including audio (edge models)
- You need Apache 2.0 licensing without any user caps
- Cost per token is a primary concern (Gemma 3 4B pricing is aggressive)
- You want MoE efficiency for serving large-model intelligence on consumer hardware
When to Pick Llama
- You need models larger than 31B parameters (Llama 3.1 405B)
- Your existing infrastructure is optimized for Llama's architecture
- You need the widest community ecosystem for a specific task (Llama has more third-party tooling in some verticals)
- The 700M MAU cap does not apply to your use case
Specialized Variants
Beyond the base models, Google DeepMind has released purpose-built Gemma variants for specific domains. These are not just fine-tunes; they include architectural modifications and domain-specific training data:
Running Gemma Locally
One of Gemma's strongest selling points is how accessible it is for local deployment. The model weights are available directly from Hugging Face and through Google's Kaggle model hub.
Edge models (E2B, E4B): These run on consumer laptops, Raspberry Pi-class devices, and mobile phones. The E2B model requires roughly 4GB of RAM in quantized form. No dedicated GPU needed for inference at modest throughput.
26B MoE: The MoE architecture makes this model surprisingly accessible. With QLoRA quantization via Unsloth, the 26B MoE fits in 16GB of VRAM. That puts it within reach of an NVIDIA RTX 4060 Ti or equivalent. For fine-tuning, Unsloth's memory-efficient approach means you can train on consumer hardware that would not support the full-precision model.
31B Dense: Requires 24GB+ VRAM for quantized inference (RTX 4090, A5000, or equivalent). For full-precision, plan for 48GB+ (A6000 or dual GPU setup).
All models work with the standard open-source inference stack: ollama, llama.cpp, vLLM, and Hugging Face transformers. The Apache 2.0 license means no registration, no API keys, and no usage reporting required for local deployment.
Release Timeline
Limitations & Honest Caveats
Video Resources
Go Deeper
Resources from across Tech Jacks Solutions
FREEAI Risk Management Template
Identify, assess, and mitigate AI deployment risks
EU AI Act Guide
Check your compliance obligations under the EU AI Act
FREEAI Bias Assessment
Evaluate bias risks before deploying any AI system
What Is Agentic AI?
Understand the architecture behind autonomous AI agents
AI Career Paths
Explore roles that work with these tools daily