Open Source AI News: NVIDIA Releases Nemotron-Labs-Diffusion, Tri-Mode LM Claims 5.9x Throughput Gain

May 23, 2026 2 min read NVIDIA Partial Strong

Tech Jacks Solutions AI News Coverage

NVIDIA has released the Nemotron-Labs-Diffusion model family as open weights on Hugging Face, 3B, 8B, and 14B variants that generate tokens in parallel blocks rather than one at a time. According to NVIDIA's release data, the 8B model decodes 5.9x more tokens per forward pass than Qwen3-8B; no independent evaluation lab has verified the figure yet.

open-source-ai-news nvidia nemotron-labs-diffusion diffusion-language-models inference-efficiency ai-models-news ai-hardware-news open-weights

Throughput claim vs. Qwen3-8B, 5.9x (vendor-reported)

Key Takeaways

NVIDIA released Nemotron-Labs-Diffusion (3B, 8B, 14B) as open weights on Hugging Face, available to test now
NVIDIA reports 5.9x more tokens per forward pass vs. Qwen3-8B; figure is vendor-reported, corroborated by developer community but not independently evaluated
Full throughput gains require GB200 or H100 hardware with SGLang integration, previous-generation hardware performance isn't documented
No independent evaluation lab (Epoch AI, LMSYS) has assessed the model yet; wait for third-party benchmarks before production migration

Model Release

Nemotron-Labs-Diffusion (3B / 8B / 14B)

OrganizationNVIDIA

TypeOpen Source LLM

Parameters3B, 8B, 14B

Benchmark[SELF-REPORTED] 5.9x tokens/forward pass vs. Qwen3-8B (8B model); 4x SPEED-Bench throughput on GB200 with SGLang (pending independent verification)

AvailabilityOpen weights, huggingface.co/collections/nvidia/nemotron-labs-diffusion

Autoregressive models pick tokens one by one. That’s the bottleneck. NVIDIA’s Nemotron-Labs-Diffusion, released as open weights this week, attempts a different approach: generating multiple tokens per forward pass using a diffusion-based architecture that switches inference modes depending on the task.

The model family covers three sizes, 3B, 8B, and 14B, in both base and instruct variants. What’s unusual is that all three modes (Autoregressive, Diffusion, and Self-Speculation) run on the same model weights. According to community testing documentation, the architecture changes its attention pattern at inference time to switch between modes. NVIDIA reports the Self-Speculation mode is what drives the headline throughput figure: 5.9x more tokens decoded per forward pass compared to Qwen3-8B at equivalent accuracy, per NVIDIA’s release data. The figure has been repeated across developer community sources on LinkedIn and X, though all of those sources appear to be reporting NVIDIA’s own numbers rather than running independent tests.

The catch is the hardware dependency. The highest throughput claims, approximately 4x improvement on the SPEED-Bench benchmark, are tied to SGLang integration running on GB200 or H100 hardware, per NVIDIA’s technical documentation. If your inference stack runs on A100s or older, don’t expect those numbers. The announcement doesn’t address what throughput looks like on previous-generation hardware at scale.

Disputed Claim

4x higher throughput on SPEED-Bench with SGLang on GB200

Specific claim not found in verified source content (SGLang repo snippet is navigation only; primary NVIDIA dev.to source is broken). Hardware-specific result may not generalize to A100 or older configurations.

Test on your target hardware configuration before treating this as a production baseline.

Open weights are live now. The NVIDIA Nemotron-Labs-Diffusion collection on Hugging Face is confirmed active, making this immediately testable for any team with compatible hardware. The SGLang serving framework, which NVIDIA cites for its SPEED-Bench integration, is a real, actively maintained project. Whether it delivers 4x throughput in your environment is what teams will need to test independently.

No independent evaluation lab has assessed Nemotron-Labs-Diffusion yet. The SPEED-Bench throughput claim specifically hasn’t been confirmed from verified source content beyond NVIDIA’s own release. The underlying architecture paper is cited as arXiv:2512.14067 (Efficient-DLM), though that paper’s classification as vendor-authored versus independent research hasn’t been confirmed.

This isn’t the first time inference efficiency has driven a major open-weights release. NVIDIA’s move follows a pattern visible across the past two quarters: inference costs are under significant competitive pressure, and architectural differentiation (not just scale) is how vendors are competing. Diffusion LMs are one proposed path out of the autoregressive speed ceiling. Speculative decoding is another. The architectures are multiplying faster than practitioners can benchmark them.

Unanswered Questions

What's the throughput delta on A100 vs. GB200, the announcement only documents GB200 performance
Does accuracy hold across the 5.9x throughput gain at longer context lengths (>16K tokens)?
What's the token cost comparison vs. Qwen3-8B at volume when factoring in hardware requirements?

What to watch

the first independent benchmark run on publicly available hardware. Community test results on A100 and H100 configurations, not GB200 showcase numbers, will reveal whether the throughput claims hold outside NVIDIA’s optimal conditions. Watch for Epoch AI or LMSYS coverage in the next two to four weeks.

Wait for independent benchmarks before migrating any production inference pipeline. The open weights are worth downloading and testing today, but the 5.9x claim is vendor-reported and the 4x SGLang figure requires confirmation on your specific hardware. Run your own latency and throughput tests at your target context length before committing to an infrastructure shift.