DiffusionGemma Generates Over 1,000 Tokens Per Second, What Developers Actually Give Up

June 12, 2026 2 min read Hugging Face, Gemma 4 Release Blog Partial Strong

G S

Tech Jacks Solutions AI News Coverage

Google has published new performance data for DiffusionGemma, its parallel text diffusion model built on the Gemma 4 MoE architecture, reporting speeds exceeding 1,000 tokens per second on a single H100, at a measurable cost to reasoning benchmark scores. The tradeoff is real and the use-case targeting matters.

open-source-ai google-deepmind diffusion-models ai-inference gemma-4 ai-models-news ai-tools-news

GPQA Diamond delta, 9.1 points

Key Takeaways

Google reports DiffusionGemma generates 1,000+ tokens/sec on a single H100, 4x faster than autoregressive equivalents, per Google's published data
The speed comes at a verified cost: 77.6% MMLU Pro and 73.2% GPQA Diamond vs. 82.6% and 82.3% for the standard Gemma 4 26B (Google's evaluation, Epoch AI pending)
Architecture uses discrete text diffusion, 26B MoE, ~3.8B active parameters, Apache 2.0 license, available on Hugging Face, Kaggle, and Vertex AI
Benchmark figures are self-reported; independent evaluation is pending, not a deployment signal until Epoch AI or equivalent weighs in

Model Release

DiffusionGemma (diffusiongemma-26B-A4B-it)

OrganizationGoogle DeepMind

TypeOpen Source LLM

Parameters26B total / ~3.8B active (MoE)

Benchmark[SELF-REPORTED] MMLU Pro: 77.6% | GPQA Diamond: 73.2%, Epoch AI evaluation pending

AvailabilityHugging Face, Kaggle, Google Cloud Vertex AI Model Garden, Apache 2.0

Speed is the whole pitch. Google states DiffusionGemma generates over 1,000 tokens per second on a single NVIDIA H100 GPU, a figure the company attributes to the model’s core architectural difference: instead of producing tokens one at a time like every standard autoregressive LLM, it generates up to 256 tokens simultaneously in a single forward pass. According to the Hugging Face Gemma 4 release documentation, DiffusionGemma uses discrete diffusion, denoising blocks of text in parallel rather than predicting the next token sequentially. That’s not an optimization of the standard approach. It’s a different approach.

The speed claim checks out on the headline number. Google’s published performance data describes DiffusionGemma as “4x faster” than autoregressive equivalents, confirmed by the Google Blog headline. The 1,000+ tokens/sec absolute figure and the 256-tokens-per-pass detail are attributed to Google and haven’t been independently reproduced yet. The catch is the “5x faster” figure that’s circulated in some coverage: it isn’t supported by the primary source. Use the 4x figure.

The benchmark cost is documented. According to Google’s evaluation, DiffusionGemma scores 77.6% on MMLU Pro and 73.2% on GPQA Diamond. The standard autoregressive Gemma 4 26B scores 82.6% on MMLU Pro and 82.3% on GPQA Diamond, roughly a 5-to-9 point gap on reasoning-heavy benchmarks. These figures are self-reported and haven’t been independently verified; an Epoch AI evaluation is pending. Until that evaluation lands, treat the specific scores as directional rather than definitive.

Benchmark Scores, DiffusionGemma vs. Gemma 4 26B (Standard)

DiffusionGemma, MMLU Pro

77.6% (self-reported)

Gemma 4 26B Standard, MMLU Pro

82.6% (self-reported)

DiffusionGemma, GPQA Diamond

73.2% (self-reported)

Gemma 4 26B Standard, GPQA Diamond

82.3% (self-reported)

The architecture sits on the Gemma 4 26B Mixture-of-Experts foundation, 26 billion total parameters, with approximately 3.8 billion active during inference, confirmed across multiple independent sources including the Hugging Face model documentation. Google states the quantized model (NVFP4 format) runs in under 18GB of VRAM, putting it within reach of RTX 4090 and 5090 hardware. Weights are open under an Apache 2.0 license, available through Hugging Face, Kaggle, and Google Cloud Vertex AI Model Garden. The 18GB VRAM figure is vendor-stated; don’t build your infrastructure plan around it before independent confirmation.

The part nobody mentions: GPQA Diamond at 73.2% is a meaningful gap from the autoregressive baseline. GPQA Diamond specifically tests doctoral-level scientific reasoning, the kind of problem-solving that makes LLMs genuinely useful for research, complex coding, and multi-step analysis. A 9-point drop there isn’t a rounding error. For throughput-heavy workloads, synthetic data generation, real-time content pipelines, high-volume summarization, the speed advantage is real and the accuracy loss may not matter. For reasoning-intensive tasks, the math points the other way.

Verification

Partial Google Blog (speed claim) + Hugging Face documentation (architecture) + T3 secondary sources (parameters) Benchmark scores are vendor self-reported. 1,000+ tokens/sec absolute figure is Google-attributed only. Epoch AI independent evaluation pending.

Disputed Claim

DiffusionGemma is 4x to 5x faster than autoregressive equivalents

The 4x figure is confirmed by Google's published headline. The 5x upper bound is not supported by the primary source.

Use 'up to 4x faster, per Google', remove the 5x figure from any production context.

Don’t expect DiffusionGemma to replace your primary inference stack. This is a specialized tool for specific throughput scenarios, not a general-purpose upgrade. The architectural novelty is genuine, discrete text diffusion is a meaningful departure from the autoregressive paradigm that’s dominated LLM development, but novelty and utility aren’t the same thing.

Wait for the Epoch AI evaluation before any production migration decisions. The self-reported benchmarks are directional; they’re not a deployment green light.