Open Source AI News: Gemma 4 QAT Checkpoints and MTP Draft Models Are Live, What Developers Get and What's Still Pending

June 7, 2026 3 min read Google DeepMind Partial Strong

Tech Jacks Solutions AI News Coverage

Google DeepMind released quantization-optimized checkpoints and multi-token prediction draft models for Gemma 4 on June 5, building on the base model launched three days earlier. For developers running Gemma 4 locally, this is a meaningful throughput upgrade, with one self-reported speedup figure still awaiting independent confirmation.

open-source-ai gemma google-deepmind quantization local-inference on-device-ai ai-models

Reported throughput gain, 2.5x

Key Takeaways

Google DeepMind released Q4_0 and W4A16 QAT checkpoints plus MTP assistant draft models for Gemma 4 12B on June 5, free under Gemma open license
QAT's advantage over PTQ at low bit-widths is architecturally established in quantization literature; the specific Gemma 4 speedup claim (up to 2.5x) is self-reported, Epoch evaluation pending
MTP draft models are natively supported in Hugging Face Transformers; vLLM integration is a submitted PR, not yet merged, check status before building on vLLM stacks
Sub-1GB drafter footprint figure held pending Wire re-research on the "31B" parameter label clarification

Model Release

Gemma 4 QAT Checkpoints & MTP Assistant Draft Models

OrganizationGoogle DeepMind

TypeOpen Source LLM

Parameters12B (base model; drafter parameter count under re-research)

Benchmark[SELF-REPORTED] Up to 2.5x throughput improvement, Epoch evaluation pending

AvailabilityFree, Hugging Face (Gemma open license); vLLM PR submitted, not yet merged

Verification

Partial Google DeepMind Gemma product page (inaccessible) + Hugging Face repository 2.5x speedup is self-reported; Epoch evaluation pending; sub-1GB drafter footprint claim held pending Wire clarification on parameter count labeling

The Gemma 4 12B base model launched June 3. Two days later, Google DeepMind shipped the deployment optimization layer: QAT checkpoints and MTP assistant draft models for developers running the model on local hardware. These aren’t architecture changes to the underlying model. They’re compression and inference acceleration formats designed to make Gemma 4 faster and lighter without rebuilding from scratch.

Two release types. First: QAT (Quantization-Aware Training) checkpoints in Q4_0 and W4A16 formats. QAT bakes quantization into the training process itself rather than applying it after the fact, that’s the difference from PTQ (Post-Training Quantization), and it matters. At aggressive bit-widths like 4-bit, PTQ typically degrades reasoning capability more than QAT does. This is established in the quantization literature, not a Google-specific claim. Q4_0 and W4A16 are standard format designators used across open-source ML tooling, not proprietary naming.

Second: MTP (Multi-Token Prediction) assistant draft models. Speculative decoding, predicting multiple tokens per forward pass using a smaller draft model, then verifying with the full model, is a documented technique for throughput improvement in local inference. Gemma 4’s MTP assistants implement this. Google reports up to 2.5x throughput improvement in local token generation compared to running the base model without speculative decoding. That figure comes from Google’s own evaluation with the primary source currently inaccessible; independent evaluation is pending. Use it as a directional reference, not a deployment guarantee.

Gemma 4 Local Inference: Before and After June 5 Update

Before June 5

Gemma 4 12B base model only; full-precision weights; no speculative decoding; HF Transformers and vLLM support

→

After June 5

QAT checkpoints (Q4_0, W4A16) for compressed local inference; MTP draft models for speculative decoding; HF Transformers native support; vLLM support via pending PR

The part nobody mentions: vLLM support for the MTP draft models isn’t there yet. A pull request has been submitted, as of June 5, 2026, it hasn’t merged. Hugging Face Transformers supports the MTP models natively. If your inference stack runs on vLLM, check the PR status before building on this. HF Transformers works now. vLLM is coming.

Gemma 4’s 256K context window carries over from the June 3 base model release, that’s an inherited spec, not a new announcement from this update. Models are available under the open-source Gemma license at no cost from Hugging Face.

One figure that won’t appear in this brief: a specific memory footprint claim for the MTP drafter model. The Wire’s package included a sub-1GB figure paired with a “31B” label, a combination that’s technically implausible as written (a 31B-parameter model at any standard quantization format requires substantially more than 1GB). The Filter has sent a re-research request to clarify whether “31B” refers to the base model the drafter serves or the drafter’s own parameter count. That figure will appear here when the clarification resolves.

What to Watch

Epoch AI evaluation of Gemma 4 QAT checkpoint performance4-8 weeks

vLLM PR merge for MTP draft model support2-4 weeks

Wire re-research resolution on MTP drafter parameter count and memory footprintNext cycle

What to watch

Epoch AI evaluation of the QAT checkpoint performance, that’s the trigger to move from “Google reports 2.5x” to a verified throughput figure. Also watch for the vLLM PR merge, which determines when this release becomes accessible to the majority of production local inference stacks.

The TJS read: if you’re running Gemma 4 12B locally on HF Transformers, download the MTP draft models and run your own inference tests. The architectural logic behind both QAT and speculative decoding is sound. The specific speedup number is directional. Your hardware configuration will determine your actual result. Don’t wait for Epoch before testing, wait for Epoch before quoting the number to your team as a confirmed baseline.