Gemma Pricing & Models: Sizes, Licensing & API Costs
Bottom line: Gemma 4 costs nothing to self-host. The Apache 2.0 license has no revenue caps, no user thresholds, and no redistribution restrictions. You pay only for hardware and electricity. Third-party API access starts at roughly $0.02 per million input tokens.
What Does Gemma Cost?
The short answer: nothing, if you run it yourself. Gemma 4 ships under the Apache 2.0 license. You download the model weights from Kaggle, Hugging Face, or Google AI Studio, load them onto your own GPU, and start generating. No license fees. No API keys. No monthly subscription.
The practical cost is hardware. A laptop with an RTX 4060 (8 GB VRAM) can run the E4B model at usable speeds. The 27B MoE variant needs a workstation-class GPU with 16 GB or more. If you already own the hardware, the incremental cost is electricity.
If you do not want to manage infrastructure, third-party providers serve Gemma models via API. Gemma 3 4B runs at approximately $0.02 per million input tokens and $0.04 per million output tokens through OpenRouter. That pricing puts Gemma among the cheapest API-accessible models available in 2026.
Google does not sell direct Gemma API access the way it sells Gemini. Gemma is a weights release, not a hosted service. You either self-host or use a third-party inference provider.
Licensing Deep Dive
Gemma's licensing history has two chapters. Earlier generations (Gemma 1, 2, and 3) shipped under a custom "Gemma Terms of Use" license. That license was source-available rather than truly open source. It allowed inspection and modification of the weights but imposed conditions on commercial redistribution that made corporate legal teams uncomfortable.
Gemma 4 changed the game. Google released the entire Gemma 4 family under Apache 2.0, which is one of the most permissive licenses in the software world. Here is what Apache 2.0 gives you:
- Commercial use with zero restrictions. Sell products, charge for services, embed Gemma in paid applications. No royalties.
- Redistribution of original or modified weights. Package Gemma inside your product and ship it to customers.
- Fine-tuning with no obligations beyond preserving the license notice. Your fine-tuned model stays yours.
- No user thresholds. Unlike some open-weight licenses, there is no cap on monthly active users or annual revenue.
- Patent grant. Apache 2.0 includes an explicit patent license from contributors, reducing legal risk for enterprise adoption.
Practical takeaway: If your legal team blocked an earlier Gemma deployment because of the Terms of Use, revisit the conversation. Apache 2.0 is pre-approved by most enterprise open-source review boards.
Model Size Guide
Gemma 4 ships in four sizes. The naming convention uses the effective parameter count (the number of parameters active during inference), not the total parameter count including embeddings. This distinction matters for memory planning.
The MoE (Mixture of Experts) architecture in the 27B variant deserves explanation. While the model contains 26 billion total parameters across 128 expert networks, only 3.8 billion activate for any given token. This means it runs faster and uses less memory than a 26B dense model would, while still benefiting from the collective knowledge stored across all experts.
AI Governance Charter
Establish your organization's AI principles in one document
Download Free →API Pricing by Provider
Google does not operate a dedicated Gemma API. Instead, third-party inference providers host Gemma models and charge per token. Pricing varies by provider, model size, and whether you commit to reserved capacity or use on-demand pricing.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Provider |
|---|---|---|---|
| Gemma 3 4B | ~$0.02 | ~$0.04 | OpenRouter |
| Llama 3.1 70B (comparison) | ~$0.20 | ~$0.20 | OpenRouter |
The cost gap is significant. Gemma 3 4B delivers roughly 10x cheaper inference than Llama 3.1 70B. The tradeoff is capability: a 4B parameter model cannot match a 70B model on complex reasoning tasks. But for classification, summarization, extraction, and straightforward generation, the smaller model often performs well enough that the cost savings justify the quality difference.
Self-Hosting Cost Calculator
Self-hosting eliminates per-token costs entirely. Your expenses are hardware (one-time or amortized) and electricity (ongoing). Here is a realistic breakdown for each model tier.
| Model | Minimum GPU | VRAM Needed | GPU Cost (New) | Electricity/Month |
|---|---|---|---|---|
| E2B | RTX 3060 / Apple M1 | ~3 GB | $300-$400 | $3-$8 |
| E4B | RTX 4060 / Apple M2 | ~6 GB | $300-$500 | $5-$12 |
| 27B MoE (4-bit) | RTX 4090 / A6000 | ~14-16 GB | $1,600-$4,500 | $15-$30 |
| 31B Dense (4-bit) | RTX 4090 / A6000 | ~16-22 GB | $1,600-$4,500 | $15-$30 |
For the E2B and E4B models, most developers already own capable hardware. The marginal cost of running Gemma locally is effectively just electricity. For the larger models, the RTX 4090 remains the best consumer-grade option. If you already own one for gaming or creative work, Gemma 4 is a free addition to your toolkit.
Cloud GPU alternative: Renting cloud GPU instances (Lambda Labs, RunPod, vast.ai) typically costs $0.50 to $2.00 per hour depending on the GPU tier. For intermittent use, this can be cheaper than buying hardware. For sustained workloads, owned hardware wins within 3 to 6 months.
Gemma vs Llama Licensing
This comparison matters because Gemma and Llama are the two most-deployed open-weight model families. Their licenses differ in one critical area.
| Clause | Gemma 4 (Apache 2.0) | Llama 3.x (Meta License) |
|---|---|---|
| Commercial use | Unrestricted | Allowed |
| User threshold | None | 700M MAU cap |
| Revenue cap | None | None stated |
| Redistribution | Allowed (preserve notice) | Allowed with restrictions |
| Fine-tuning | Unrestricted | Allowed |
| Patent grant | Explicit (Apache 2.0 Sec. 3) | Limited |
| OSI approved | Yes | No |
The 700 million monthly active user threshold in Meta's Llama license is irrelevant for most organizations. But if you are building infrastructure for a platform that could reach that scale, Gemma's Apache 2.0 removes a potential future licensing conversation entirely. More importantly, many enterprise procurement teams have pre-approved Apache 2.0 for internal use. Meta's custom license typically requires separate legal review.
Fine-Tuning Costs
Fine-tuning Gemma is where the Apache 2.0 license pays for itself. There are no API-based fine-tuning fees because there is no first-party API. You fine-tune locally using open-source tooling.
- Software cost: $0. Use Hugging Face Transformers, PEFT, or Unsloth. All open-source.
- Hardware for E2B/E4B: A single consumer GPU (RTX 3060 or better). QLoRA reduces memory requirements to a fraction of full fine-tuning.
- Hardware for 27B/31B: A single RTX 4090 can handle QLoRA fine-tuning. Full fine-tuning requires multi-GPU setups or cloud GPU rentals.
- Time: A typical QLoRA fine-tune on a domain-specific dataset (10K to 50K examples) takes 2 to 8 hours on an RTX 4090 for the E4B model.
- Electricity: At 450W GPU power draw and $0.12/kWh, an 8-hour fine-tuning run costs approximately $0.43 in electricity.
Compare this to fine-tuning a proprietary model through an API provider, which can cost hundreds to thousands of dollars per run depending on dataset size and model tier. The Gemma approach is orders of magnitude cheaper if you have the hardware.
When to Pay for Managed APIs
Self-hosting is not always the right call. Here are situations where paying for API-based Gemma access makes more sense than running it yourself.
For steady-state workloads where you process thousands of requests daily, the math almost always favors self-hosting. The crossover point depends on your volume: at Gemma 3 4B API pricing ($0.02/$0.04 per million tokens), you would need to process hundreds of millions of tokens monthly before an RTX 4090 pays for itself faster than the API route. For the larger models at higher per-token costs, the crossover arrives sooner.
Specialized Gemma Variants
Beyond the core language models, Google has released domain-specific Gemma variants. Each inherits the same licensing terms as the base model generation it was built on.
| Variant | Domain | Use Case |
|---|---|---|
| PaliGemma | Vision-language | Image captioning, visual question answering, document understanding |
| CodeGemma | Code | Code generation, completion, explanation, and refactoring |
| MedGemma | Medical | Clinical text analysis, medical Q&A (not for diagnosis) |
| ShieldGemma | Safety | Content filtering, toxicity detection, safety classification |
These variants cost the same as the base models to run: nothing for the software, your hardware for inference. They are particularly valuable because equivalent proprietary solutions (GPT-4 Vision, Claude with image inputs) charge premium per-token rates for multimodal capabilities.
Frequently Asked Questions
Is Gemma really free?
Yes. The Gemma 4 model weights are released under Apache 2.0. There is no licensing fee, no subscription, and no per-use charge from Google. Your costs are hardware (one-time) and electricity (ongoing). Third-party API providers charge for hosting, but that is their service fee, not a Google charge.
Can I use Gemma in a commercial product?
Yes, without restrictions under Gemma 4's Apache 2.0 license. You can embed Gemma in paid products, offer it as a service, or redistribute modified weights. The only requirement is preserving the Apache 2.0 license notice.
What changed from Gemma 3 to Gemma 4 licensing?
Gemma 1 through 3 used a custom "Gemma Terms of Use" license that was source-available but not OSI-approved open source. Gemma 4 switched to Apache 2.0, removing all commercial restrictions and making it compatible with enterprise open-source policies.
Which Gemma model should I start with?
Start with E4B. It runs on most modern laptops with a discrete GPU (6 GB VRAM), supports 128K context, and offers the best balance between capability and hardware requirements. Move to 27B MoE only if E4B's quality is insufficient for your task.
How does Gemma compare to GPT-4 on cost?
Self-hosted Gemma has zero marginal cost per token. GPT-4 charges $30 per million input tokens and $60 per million output tokens (as of early 2026). The quality gap exists, but for many production tasks like classification, extraction, and summarization, Gemma's smaller models deliver adequate results at a fraction of the cost.
Go Deeper
Resources from across Tech Jacks Solutions
FREEAI Governance Charter
Establish your organization's AI principles in one document
AI Career Paths
Explore roles that work with these tools daily
EU AI Act Guide
Check your compliance obligations under the EU AI Act
FREEAI Risk Management Template
Identify, assess, and mitigate AI deployment risks
FREEAI Bias Assessment
Evaluate bias risks before deploying any AI system