Open Source AI News: NVIDIA's Gated DeltaNet-2 Fixes the Memory Editing Problem in Linear Attention

May 24, 2026 3 min read arXiv (NVIDIA NVlabs, Hatamizadeh, Choi, Kautz) Partial Strong

Tech Jacks Solutions AI News Coverage

NVIDIA's NVlabs team has published Gated DeltaNet-2, an open-source linear attention architecture that decouples memory erase and write operations using channel-wise gates, a targeted fix for a known limitation in recurrent memory editing. According to NVIDIA's technical report, the architecture reduces sequence mixing to linear time and decoding to constant memory, eliminating the unbounded KV cache that makes softmax transformers expensive at inference.

linear-attention open-source-llm nvidia nv-labs ssm mamba architecture-research ai-models-news open-source-ai-news

Training scale, 1.3B params, 100B tokens

Key Takeaways

NVIDIA's Gated DeltaNet-2 decouples memory erase and write via channel-wise gates, a targeted fix for a known limitation in recurrent linear attention architectures.
The architecture delivers constant-memory decoding and linear-time sequence mixing, confirmed structural properties per the arXiv paper; benchmark superiority over Mamba-2, KDA, and
Mamba-3 is vendor-reported only.
According to NVIDIA's technical report, the 1.3B parameter model trained on 100B FineWeb-Edu tokens; no independent evaluation exists yet.
Weights and code are expected on GitHub, repository availability couldn't be confirmed at publication; verify before building.

Model Release

Gated DeltaNet-2

OrganizationNVIDIA AI (NVlabs)

TypeOpen Source LLM

Parameters1.3B (per NVIDIA technical report)

Benchmark[SELF-REPORTED] Outperforms Mamba-2, Gated DeltaNet, KDA, Mamba-3 on standard LM benchmarks (matched parameter scale), vendor claim, no independent eval

AvailabilityOpen-source weights via GitHub (repository not confirmed live at publication)

Your inference pipeline’s memory bill doesn’t shrink by accident. It shrinks when someone solves a specific mathematical problem. NVIDIA’s NVlabs team may have done that. The team, Ali Hatamizadeh, Yejin Choi, and Jan Kautz, published Gated DeltaNet-2 on arXiv on May 21, 2026. The architecture introduces a channel-wise erase gate on the key axis and a separate write gate on the value axis, replacing the tied scalar delta gate used in the original Gated DeltaNet. That single change is the whole point. Tied scalar gates force erase and write to move together, which constrains how precisely the model can update recurrent memory states. Decoupling them gives the architecture finer control over what gets overwritten and what survives across steps. The architectural claims that matter most aren’t the benchmark numbers. They’re the structural ones: the paper confirms the architecture reduces sequence mixing to linear time and decoding to constant memory. No growing KV cache at inference. That’s the property that makes or breaks deployment economics for long-context tasks. According to NVIDIA’s technical report, the model was trained at 1.3B parameters on 100 billion FineWeb-Edu tokens. NVIDIA’s paper also reports that the architecture outperforms Mamba-2, the original Gated DeltaNet, KDA, and Mamba-3 on standard language modeling benchmarks at matched parameter scale. Those are vendor-reported results, no independent evaluation exists yet. Treat the comparative rankings as a strong hypothesis, not a settled outcome. The catch is that “constant memory decoding” and “linear time mixing” are architectural guarantees, not latency guarantees. A 1.3B linear attention model doesn’t automatically beat a larger softmax transformer on wall-clock throughput for your specific hardware and batch size. The paper doesn’t disclose inference latency figures, cost per token, or hardware requirements. NVIDIA has indicated weights and code will be made available through GitHub. The repository URL couldn’t be confirmed at time of publication, check the NVlabs GitHub directly before building anything on top of it. The broader context: this release lands in the middle of a crowded 18-month race to build recurrent architectures that can compete with softmax transformers without their inference costs. Mamba, Mamba-2, Gated DeltaNet, KDA, and Mamba-3 each addressed part of the problem. GDN2’s contribution is the erase/write decoupling, a more precise memory editing mechanism that prior architectures in the family couldn’t offer.

Disputed Claim

Gated DeltaNet-2 outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 on standard language modeling benchmarks

Self-reported benchmarks only; no third-party or Epoch AI evaluation exists at time of publication

Treat comparative rankings as a research hypothesis. Wait for independent evaluation before making architecture decisions based on benchmark comparisons.

What to watch

independent benchmark evaluation from Epoch AI or a third-party research group. The vendor benchmark comparison is plausible, but GDN2’s claims only become actionable build decisions once someone outside NVIDIA runs the same tests. Watch also for whether the GitHub repository ships with reproducible training configs, that’s what the practitioner community will need to validate the 100B-token training claims. Don’t expect production-ready adoption in the next quarter. Architectural papers need time for community stress testing, and the benchmark gap between linear attention and softmax transformers at scale is still contested. What GDN2 offers practitioners right now is a concrete research target: if the decoupled gate mechanism holds up under independent evaluation, it changes the calculus for any team building on SSM or linear attention layers. Wait for independent benchmarks before migrating anything. If the NVlabs repository ships with training code and configs, run the reproduction yourself on a held-out task before committing. That’s what the paper warrants right now, serious attention, not immediate adoption.

Unanswered Questions

What are the wall-clock inference latency numbers on standard hardware configurations?
What is the cost per token at production batch sizes relative to comparable softmax transformer models?
Does the GitHub repository ship with reproducible training configs to validate the 100B-token training claims?
When will independent benchmark evaluation (Epoch AI or equivalent) be available?