Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Google Gemma

Gemma Pricing & Models: Sizes, Licensing & API Costs

Bottom line: Gemma 4 costs nothing to self-host. The Apache 2.0 license has no revenue caps, no user thresholds, and no redistribution restrictions. You pay only for hardware and electricity. Third-party API access starts at roughly $0.02 per million input tokens.

$0
License Cost
Apache 2.0
License Type
4
Model Sizes
256K
Max Context

What Does Gemma Cost?

The short answer: nothing, if you run it yourself. Gemma 4 ships under the Apache 2.0 license. You download the model weights from Kaggle, Hugging Face, or Google AI Studio, load them onto your own GPU, and start generating. No license fees. No API keys. No monthly subscription.

The practical cost is hardware. A laptop with an RTX 4060 (8 GB VRAM) can run the E4B model at usable speeds. The 27B MoE variant needs a workstation-class GPU with 16 GB or more. If you already own the hardware, the incremental cost is electricity.

10x
Gemma 3 4B costs roughly one-tenth the per-token price of Llama 3.1 70B through third-party API providers like OpenRouter.

If you do not want to manage infrastructure, third-party providers serve Gemma models via API. Gemma 3 4B runs at approximately $0.02 per million input tokens and $0.04 per million output tokens through OpenRouter. That pricing puts Gemma among the cheapest API-accessible models available in 2026.

Google does not sell direct Gemma API access the way it sells Gemini. Gemma is a weights release, not a hosted service. You either self-host or use a third-party inference provider.


Licensing Deep Dive

Gemma's licensing history has two chapters. Earlier generations (Gemma 1, 2, and 3) shipped under a custom "Gemma Terms of Use" license. That license was source-available rather than truly open source. It allowed inspection and modification of the weights but imposed conditions on commercial redistribution that made corporate legal teams uncomfortable.

Gemma 4 changed the game. Google released the entire Gemma 4 family under Apache 2.0, which is one of the most permissive licenses in the software world. Here is what Apache 2.0 gives you:

  • Commercial use with zero restrictions. Sell products, charge for services, embed Gemma in paid applications. No royalties.
  • Redistribution of original or modified weights. Package Gemma inside your product and ship it to customers.
  • Fine-tuning with no obligations beyond preserving the license notice. Your fine-tuned model stays yours.
  • No user thresholds. Unlike some open-weight licenses, there is no cap on monthly active users or annual revenue.
  • Patent grant. Apache 2.0 includes an explicit patent license from contributors, reducing legal risk for enterprise adoption.

Practical takeaway: If your legal team blocked an earlier Gemma deployment because of the Terms of Use, revisit the conversation. Apache 2.0 is pre-approved by most enterprise open-source review boards.


Model Size Guide

Gemma 4 ships in four sizes. The naming convention uses the effective parameter count (the number of parameters active during inference), not the total parameter count including embeddings. This distinction matters for memory planning.

Edge
Gemma 4 E2B
Phone and IoT deployments
Effective 2.3B
Total 5.1B
VRAM ~3 GB
Context 128K
MoE
Gemma 4 27B MoE
128 experts, 3.8B active per token
Total 26B
Active 3.8B
VRAM (4-bit) ~14-16 GB
Context 256K
Dense
Gemma 4 31B Dense
Maximum quality, highest VRAM
Effective 30.7B
Total 31B
VRAM (4-bit) ~16-22 GB
Context 256K

The MoE (Mixture of Experts) architecture in the 27B variant deserves explanation. While the model contains 26 billion total parameters across 128 expert networks, only 3.8 billion activate for any given token. This means it runs faster and uses less memory than a 26B dense model would, while still benefiting from the collective knowledge stored across all experts.


FREE TEMPLATE

AI Governance Charter

Establish your organization's AI principles in one document

Download Free →

API Pricing by Provider

Google does not operate a dedicated Gemma API. Instead, third-party inference providers host Gemma models and charge per token. Pricing varies by provider, model size, and whether you commit to reserved capacity or use on-demand pricing.

Model Input (per 1M tokens) Output (per 1M tokens) Provider
Gemma 3 4B ~$0.02 ~$0.04 OpenRouter
Llama 3.1 70B (comparison) ~$0.20 ~$0.20 OpenRouter
Pricing verified May 28, 2026 via OpenRouter. Rates fluctuate. Check provider dashboards for current pricing.

The cost gap is significant. Gemma 3 4B delivers roughly 10x cheaper inference than Llama 3.1 70B. The tradeoff is capability: a 4B parameter model cannot match a 70B model on complex reasoning tasks. But for classification, summarization, extraction, and straightforward generation, the smaller model often performs well enough that the cost savings justify the quality difference.


Self-Hosting Cost Calculator

Self-hosting eliminates per-token costs entirely. Your expenses are hardware (one-time or amortized) and electricity (ongoing). Here is a realistic breakdown for each model tier.

Model Minimum GPU VRAM Needed GPU Cost (New) Electricity/Month
E2B RTX 3060 / Apple M1 ~3 GB $300-$400 $3-$8
E4B RTX 4060 / Apple M2 ~6 GB $300-$500 $5-$12
27B MoE (4-bit) RTX 4090 / A6000 ~14-16 GB $1,600-$4,500 $15-$30
31B Dense (4-bit) RTX 4090 / A6000 ~16-22 GB $1,600-$4,500 $15-$30
GPU prices are approximate US retail as of May 2026. Electricity assumes $0.12/kWh and moderate daily usage.

For the E2B and E4B models, most developers already own capable hardware. The marginal cost of running Gemma locally is effectively just electricity. For the larger models, the RTX 4090 remains the best consumer-grade option. If you already own one for gaming or creative work, Gemma 4 is a free addition to your toolkit.

Cloud GPU alternative: Renting cloud GPU instances (Lambda Labs, RunPod, vast.ai) typically costs $0.50 to $2.00 per hour depending on the GPU tier. For intermittent use, this can be cheaper than buying hardware. For sustained workloads, owned hardware wins within 3 to 6 months.


Gemma vs Llama Licensing

This comparison matters because Gemma and Llama are the two most-deployed open-weight model families. Their licenses differ in one critical area.

Clause Gemma 4 (Apache 2.0) Llama 3.x (Meta License)
Commercial use Unrestricted Allowed
User threshold None 700M MAU cap
Revenue cap None None stated
Redistribution Allowed (preserve notice) Allowed with restrictions
Fine-tuning Unrestricted Allowed
Patent grant Explicit (Apache 2.0 Sec. 3) Limited
OSI approved Yes No

The 700 million monthly active user threshold in Meta's Llama license is irrelevant for most organizations. But if you are building infrastructure for a platform that could reach that scale, Gemma's Apache 2.0 removes a potential future licensing conversation entirely. More importantly, many enterprise procurement teams have pre-approved Apache 2.0 for internal use. Meta's custom license typically requires separate legal review.


Fine-Tuning Costs

Fine-tuning Gemma is where the Apache 2.0 license pays for itself. There are no API-based fine-tuning fees because there is no first-party API. You fine-tune locally using open-source tooling.

  • Software cost: $0. Use Hugging Face Transformers, PEFT, or Unsloth. All open-source.
  • Hardware for E2B/E4B: A single consumer GPU (RTX 3060 or better). QLoRA reduces memory requirements to a fraction of full fine-tuning.
  • Hardware for 27B/31B: A single RTX 4090 can handle QLoRA fine-tuning. Full fine-tuning requires multi-GPU setups or cloud GPU rentals.
  • Time: A typical QLoRA fine-tune on a domain-specific dataset (10K to 50K examples) takes 2 to 8 hours on an RTX 4090 for the E4B model.
  • Electricity: At 450W GPU power draw and $0.12/kWh, an 8-hour fine-tuning run costs approximately $0.43 in electricity.

Compare this to fine-tuning a proprietary model through an API provider, which can cost hundreds to thousands of dollars per run depending on dataset size and model tier. The Gemma approach is orders of magnitude cheaper if you have the hardware.


When to Pay for Managed APIs

Self-hosting is not always the right call. Here are situations where paying for API-based Gemma access makes more sense than running it yourself.

No GPU Hardware
If your team works on CPU-only machines or laptops without discrete GPUs, API access is the only practical option. Buying a dedicated GPU for occasional inference rarely makes financial sense.
Burst Traffic Patterns
Applications with unpredictable traffic spikes benefit from elastic API infrastructure. A local GPU has fixed throughput. An API provider auto-scales to handle request surges without queuing.
Prototyping and Evaluation
When testing whether Gemma is the right model for your use case, paying fractions of a cent per request through an API is faster and cheaper than setting up local infrastructure. Commit to self-hosting after you validate the approach.

For steady-state workloads where you process thousands of requests daily, the math almost always favors self-hosting. The crossover point depends on your volume: at Gemma 3 4B API pricing ($0.02/$0.04 per million tokens), you would need to process hundreds of millions of tokens monthly before an RTX 4090 pays for itself faster than the API route. For the larger models at higher per-token costs, the crossover arrives sooner.


Specialized Gemma Variants

Beyond the core language models, Google has released domain-specific Gemma variants. Each inherits the same licensing terms as the base model generation it was built on.

Variant Domain Use Case
PaliGemma Vision-language Image captioning, visual question answering, document understanding
CodeGemma Code Code generation, completion, explanation, and refactoring
MedGemma Medical Clinical text analysis, medical Q&A (not for diagnosis)
ShieldGemma Safety Content filtering, toxicity detection, safety classification

These variants cost the same as the base models to run: nothing for the software, your hardware for inference. They are particularly valuable because equivalent proprietary solutions (GPT-4 Vision, Claude with image inputs) charge premium per-token rates for multimodal capabilities.


Frequently Asked Questions

Is Gemma really free?

Yes. The Gemma 4 model weights are released under Apache 2.0. There is no licensing fee, no subscription, and no per-use charge from Google. Your costs are hardware (one-time) and electricity (ongoing). Third-party API providers charge for hosting, but that is their service fee, not a Google charge.

Can I use Gemma in a commercial product?

Yes, without restrictions under Gemma 4's Apache 2.0 license. You can embed Gemma in paid products, offer it as a service, or redistribute modified weights. The only requirement is preserving the Apache 2.0 license notice.

What changed from Gemma 3 to Gemma 4 licensing?

Gemma 1 through 3 used a custom "Gemma Terms of Use" license that was source-available but not OSI-approved open source. Gemma 4 switched to Apache 2.0, removing all commercial restrictions and making it compatible with enterprise open-source policies.

Which Gemma model should I start with?

Start with E4B. It runs on most modern laptops with a discrete GPU (6 GB VRAM), supports 128K context, and offers the best balance between capability and hardware requirements. Move to 27B MoE only if E4B's quality is insufficient for your task.

How does Gemma compare to GPT-4 on cost?

Self-hosted Gemma has zero marginal cost per token. GPT-4 charges $30 per million input tokens and $60 per million output tokens (as of early 2026). The quality gap exists, but for many production tasks like classification, extraction, and summarization, Gemma's smaller models deliver adequate results at a fraction of the cost.

Verified May 2026
Gemma is a trademark of Google LLC. Llama is a trademark of Meta Platforms, Inc. GPT is a trademark of OpenAI. Claude is a trademark of Anthropic. All other trademarks belong to their respective owners. Tech Jacks Solutions is not affiliated with Google, Meta, or any model provider mentioned in this article.
Before You Use AI
Your Privacy

Self-hosted Gemma models process all data locally. Your prompts and outputs never leave your hardware. If you use third-party API providers (OpenRouter, Together AI, etc.), your data is transmitted to and processed on their infrastructure. Review each provider's data retention and training opt-out policies before sending sensitive content. Enterprise API tiers generally do not train on customer data; free and community tiers may.

Mental Health & AI Dependency

Open-weight models that run locally can become always-available assistants that gradually replace independent judgment in routine decisions. Maintain deliberate review of AI-generated outputs, especially for consequential decisions. If you or someone you know is experiencing a mental health crisis:

  • 988 Suicide & Crisis Lifeline -- Call or text 988 (US)
  • SAMHSA Helpline -- 1-800-662-4357
  • Crisis Text Line -- Text HOME to 741741

AI systems can produce plausible-sounding but incorrect guidance. For mental health, medical, legal, or financial decisions, always consult a qualified professional.

Your Rights & Our Transparency

Under GDPR and CCPA, you have the right to access, correct, and delete your personal data held by any API provider you use with Gemma models. Tech Jacks Solutions maintains editorial independence. This article was not sponsored, reviewed, or approved by Google, OpenRouter, or any provider mentioned. We receive no affiliate commissions from any linked service. Our evaluations are based on primary documentation, verified pricing data, and hands-on testing. The EU AI Act classifies open-weight foundation models under GPAI obligations.