The Gemma 4 12B base model launched June 3. Two days later, Google DeepMind shipped the deployment optimization layer: QAT checkpoints and MTP assistant draft models for developers running the model on local hardware. These aren’t architecture changes to the underlying model. They’re compression and inference acceleration formats designed to make Gemma 4 faster and lighter without rebuilding from scratch.
Two release types. First: QAT (Quantization-Aware Training) checkpoints in Q4_0 and W4A16 formats. QAT bakes quantization into the training process itself rather than applying it after the fact, that’s the difference from PTQ (Post-Training Quantization), and it matters. At aggressive bit-widths like 4-bit, PTQ typically degrades reasoning capability more than QAT does. This is established in the quantization literature, not a Google-specific claim. Q4_0 and W4A16 are standard format designators used across open-source ML tooling, not proprietary naming.
Second: MTP (Multi-Token Prediction) assistant draft models. Speculative decoding, predicting multiple tokens per forward pass using a smaller draft model, then verifying with the full model, is a documented technique for throughput improvement in local inference. Gemma 4’s MTP assistants implement this. Google reports up to 2.5x throughput improvement in local token generation compared to running the base model without speculative decoding. That figure comes from Google’s own evaluation with the primary source currently inaccessible; independent evaluation is pending. Use it as a directional reference, not a deployment guarantee.
Gemma 4 Local Inference: Before and After June 5 Update
The part nobody mentions: vLLM support for the MTP draft models isn’t there yet. A pull request has been submitted, as of June 5, 2026, it hasn’t merged. Hugging Face Transformers supports the MTP models natively. If your inference stack runs on vLLM, check the PR status before building on this. HF Transformers works now. vLLM is coming.
Gemma 4’s 256K context window carries over from the June 3 base model release, that’s an inherited spec, not a new announcement from this update. Models are available under the open-source Gemma license at no cost from Hugging Face.
One figure that won’t appear in this brief: a specific memory footprint claim for the MTP drafter model. The Wire’s package included a sub-1GB figure paired with a “31B” label, a combination that’s technically implausible as written (a 31B-parameter model at any standard quantization format requires substantially more than 1GB). The Filter has sent a re-research request to clarify whether “31B” refers to the base model the drafter serves or the drafter’s own parameter count. That figure will appear here when the clarification resolves.
What to Watch
What to watch
Epoch AI evaluation of the QAT checkpoint performance, that’s the trigger to move from “Google reports 2.5x” to a verified throughput figure. Also watch for the vLLM PR merge, which determines when this release becomes accessible to the majority of production local inference stacks.
The TJS read: if you’re running Gemma 4 12B locally on HF Transformers, download the MTP draft models and run your own inference tests. The architectural logic behind both QAT and speculative decoding is sound. The specific speedup number is directional. Your hardware configuration will determine your actual result. Don’t wait for Epoch before testing, wait for Epoch before quoting the number to your team as a confirmed baseline.