Google DeepMind's Gemma 4 Is Live on HuggingFace: Open Multimodal Models From 2B to 31B Parameters

July 1, 2026 3 min read Hugging Face Partial Strong

G H

Tech Jacks Solutions AI News Coverage

Google DeepMind has released the Gemma 4 open-weights model family on HuggingFace, spanning multiple sizes designed to run on consumer and edge hardware, a signal that capable multimodal AI is moving toward local deployment, though key specifications still await verification against Google's official release materials.

google-deepmind gemma-4 open-source-llm multimodal-ai local-ai-deployment on-device-ai generative-ai

Gemma 4 sizes reported, 5

Key Takeaways

Gemma 4 is live on HuggingFace, Google DeepMind's open-weights multimodal model family confirmed via repository presence
The 12B variant is confirmed; a reported five-size lineup (E2B through 31B) awaits verification against Google's official release page
Specifications including context window, VRAM requirements, and benchmark comparisons are currently sourced from third-party repositories, treat as reported, not confirmed
Google's official Gemma 4 announcement page was not captured in this reporting cycle, a follow-up is in progress

Model Release

Gemma 4 (family)

OrganizationGoogle DeepMind

TypeOpen Source LLM

ParametersReported: E2B, E4B, 12B, 26B A4B (MoE), 31B, pending verification against official release

BenchmarkNot independently verified at time of publication

AvailabilityHuggingFace, open weights

Open-weights multimodal, at sizes that fit on a laptop. The Gemma 4 family is available on HuggingFace, confirmed by the presence of optimized model files including the 12B instruction-tuned variant. Google DeepMind’s release targets developers who want capable multimodal models without API dependency, a direct move toward local and on-device AI deployment that commercial API providers can’t match on privacy or latency grounds.

The model family is reported to span five sizes, E2B, E4B, 12B, 26B A4B (a mixture-of-experts configuration), and 31B, though these specifications come from third-party repository context rather than Google’s official release page, which wasn’t captured in this reporting cycle. Treat the full size lineup as reported but pending verification against Google’s primary announcement. The 12B variant’s existence is confirmed by the HuggingFace repository URL itself.

Why it matters

The frontier isn’t just moving faster, it’s moving local. A multimodal model family that runs on consumer hardware changes the calculus for teams building private AI applications: healthcare systems that can’t send patient data to an external API, legal teams with privilege concerns, enterprises in regulated industries with data residency requirements. Gemma 4 at the E2B and E4B sizes is reportedly designed to run with minimal VRAM, one reported claim puts the E2B at 3GB of GPU memory via a Per-Layer Embeddings architecture, though this technical specification wasn’t confirmed in the source materials available for this report.

The catch is the sourcing. Google’s official Gemma 4 announcement page wasn’t among the materials available for. The specifications circulating in developer communities, size lineup, context window figures, efficiency benchmarks, are coming through third-party repositories and optimization packages like Unsloth’s GGUF conversions, not directly from DeepMind. That’s common in open-source model releases where community adoption outpaces official documentation, but it means practitioners should verify specs against Google’s primary release before building on them.

Disputed Claim

Gemma 4 E2B runs on 3GB of GPU memory via Per-Layer Embeddings architecture

Technical specification sourced from third-party optimization repository (Unsloth GGUF), not Google DeepMind's official release page, which was not captured in this reporting cycle

Verify against Google's official technical documentation before using this figure in deployment planning

Context

Gemma 4 continues a trajectory Google established with Gemma 2 and Gemma 3: increasingly capable open-weights models that give developers alternatives to proprietary APIs. The multimodal addition is significant, prior Gemma releases were text-focused. Adding vision inputs at this size range puts Gemma 4 in competition with Meta’s Llama multimodal releases and Microsoft’s Phi series for the on-device and edge deployment market. That’s a crowded space, and Google’s distribution advantage through HuggingFace and its own developer ecosystem is a real differentiator.

Don’t build your benchmark comparisons on claims that trace only to third-party repositories. An inference speed figure from a community optimization package is a performance figure for that specific quantized version, not the base model Google released.

What to watch

Watch for Google DeepMind’s official technical report or blog post on Gemma 4, that’s the source that will confirm or revise the size lineup and architectural details currently circulating from community sources. Watch also for Epoch AI or independent benchmark coverage; without it, any performance comparisons to commercial models remain unverified vendor-adjacent claims. The mixture-of-experts 26B A4B variant is worth particular attention if confirmed, MoE architectures at that active parameter count could deliver strong performance at lower inference cost than dense models of comparable capability.

Unanswered Questions

What are the confirmed context window sizes across the Gemma 4 family per Google's official release?
Does the MoE architecture in the 26B A4B variant perform comparably to dense models at higher parameter counts on practical workloads?
What are the confirmed VRAM requirements for each size at standard precision, not just quantized GGUF versions?

TJS synthesis

Hold off on architecture decisions that depend on unconfirmed Gemma 4 specs. The model family’s existence is confirmed; the details aren’t. Wait for Google’s official technical documentation before evaluating Gemma 4 against your deployment requirements. When that documentation lands, the E2B and E4B sizes are the ones to benchmark first, if the efficiency claims hold up independently, they represent a genuine option for private deployment use cases that commercial APIs can’t serve.

Sources: Google, Google, Nvidia, Hugging Face.