Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Skip to content
Technology Daily Brief Vendor Claim

Open Source AI News: NVIDIA's Gated DeltaNet-2 Fixes the Memory Editing Problem in Linear Attention

3 min read arXiv (NVIDIA NVlabs, Hatamizadeh, Choi, Kautz) Partial Strong
NVIDIA's NVlabs team has published Gated DeltaNet-2, an open-source linear attention architecture that decouples memory erase and write operations using channel-wise gates, a targeted fix for a known limitation in recurrent memory editing. According to NVIDIA's technical report, the architecture reduces sequence mixing to linear time and decoding to constant memory, eliminating the unbounded KV cache that makes softmax transformers expensive at inference.
Training scale, 1.3B params, 100B tokens

Key Takeaways

  • NVIDIA's Gated DeltaNet-2 decouples memory erase and write via channel-wise gates, a targeted fix for a known limitation in recurrent linear attention architectures.
  • The architecture delivers constant-memory decoding and linear-time sequence mixing, confirmed structural properties per the arXiv paper; benchmark superiority over Mamba-2, KDA, and
  • Mamba-3 is vendor-reported only.
  • According to NVIDIA's technical report, the 1.3B parameter model trained on 100B FineWeb-Edu tokens; no independent evaluation exists yet.
  • Weights and code are expected on GitHub, repository availability couldn't be confirmed at publication; verify before building.

Model Release

Gated DeltaNet-2
OrganizationNVIDIA AI (NVlabs)
TypeOpen Source LLM
Parameters1.3B (per NVIDIA technical report)
Benchmark[SELF-REPORTED] Outperforms Mamba-2, Gated DeltaNet, KDA, Mamba-3 on standard LM benchmarks (matched parameter scale), vendor claim, no independent eval
AvailabilityOpen-source weights via GitHub (repository not confirmed live at publication)

Verification

Partial NVIDIA vendor technical report (arXiv 2605.22791) Architectural claims confirmed at abstract level; benchmark rankings and training scale figures are vendor-reported with no independent evaluation available

Your inference pipeline’s memory bill doesn’t shrink by accident. It shrinks when someone solves a specific mathematical problem. NVIDIA’s NVlabs team may have done that.

The team, Ali Hatamizadeh, Yejin Choi, and Jan Kautz, published Gated DeltaNet-2 on arXiv on May 21, 2026. The architecture introduces a channel-wise erase gate on the key axis and a separate write gate on the value axis, replacing the tied scalar delta gate used in the original Gated DeltaNet. That single change is the whole point. Tied scalar gates force erase and write to move together, which constrains how precisely the model can update recurrent memory states. Decoupling them gives the architecture finer control over what gets overwritten and what survives across steps.

The architectural claims that matter most aren’t the benchmark numbers. They’re the structural ones: the paper confirms the architecture reduces sequence mixing to linear time and decoding to constant memory. No growing KV cache at inference. That’s the property that makes or breaks deployment economics for long-context tasks.

According to NVIDIA’s technical report, the model was trained at 1.3B parameters on 100 billion FineWeb-Edu tokens. NVIDIA’s paper also reports that the architecture outperforms Mamba-2, the original Gated DeltaNet, KDA, and Mamba-3 on standard language modeling benchmarks at matched parameter scale. Those are vendor-reported results, no independent evaluation exists yet. Treat the comparative rankings as a strong hypothesis, not a settled outcome.

Disputed Claim

Gated DeltaNet-2 outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 on standard language modeling benchmarks
Self-reported benchmarks only; no third-party or Epoch AI evaluation exists at time of publication
Treat comparative rankings as a research hypothesis. Wait for independent evaluation before making architecture decisions based on benchmark comparisons.

The catch is that “constant memory decoding” and “linear time mixing” are architectural guarantees, not latency guarantees. A 1.3B linear attention model doesn’t automatically beat a larger softmax transformer on wall-clock throughput for your specific hardware and batch size. The paper doesn’t disclose inference latency figures, cost per token, or hardware requirements. If you’re evaluating this for a production RAG pipeline, those numbers come from your own profiling, not from the benchmark tables.

NVIDIA has indicated weights and code will be made available through GitHub. The repository URL couldn’t be confirmed at time of publication, check the NVlabs GitHub directly before building anything on top of it.

The broader context: this release lands in the middle of a crowded 18-month race to build recurrent architectures that can compete with softmax transformers without their inference costs. Mamba, Mamba-2, Gated DeltaNet, KDA, and Mamba-3 each addressed part of the problem. GDN2’s contribution is the erase/write decoupling, a more precise memory editing mechanism that prior architectures in the family couldn’t offer.

What to watch

independent benchmark evaluation from Epoch AI or a third-party research group. The vendor benchmark comparison is plausible, but GDN2’s claims only become actionable build decisions once someone outside NVIDIA runs the same tests. Watch also for whether the GitHub repository ships with reproducible training configs, that’s what the practitioner community will need to validate the 100B-token training claims.

Unanswered Questions

  • What are the wall-clock inference latency numbers on standard hardware configurations?
  • What is the cost per token at production batch sizes relative to comparable softmax transformer models?
  • Does the GitHub repository ship with reproducible training configs to validate the 100B-token training claims?
  • When will independent benchmark evaluation (Epoch AI or equivalent) be available?

Don’t expect production-ready adoption in the next quarter. Architectural papers need time for community stress testing, and the benchmark gap between linear attention and softmax transformers at scale is still contested. What GDN2 offers practitioners right now is a concrete research target: if the decoupled gate mechanism holds up under independent evaluation, it changes the calculus for any team building on SSM or linear attention layers.

Wait for independent benchmarks before migrating anything. If the NVlabs repository ships with training code and configs, run the reproduction yourself on a held-out task before committing. That’s what the paper warrants right now, serious attention, not immediate adoption.

View Source
More Technology intelligence
View all Technology

Related Coverage

More from May 24, 2026

Stay ahead on Technology

Get verified AI intelligence delivered daily. No hype, no speculation, just what matters.

Explore the AI News Hub