The 18-Month Race to Replace Softmax Attention: Where Gated DeltaNet-2 Fits and What Architects Must Know

May 24, 2026 5 min read arXiv (NVIDIA NVlabs, Hatamizadeh, Choi, Kautz) Partial Strong

Tech Jacks Solutions AI News Coverage

Every six months, a new recurrent architecture arrives claiming to solve what softmax transformers can't: constant-memory inference at long context. Mamba, Mamba-2, Gated DeltaNet, KDA, Mamba-3, and now NVIDIA's Gated DeltaNet-2 have each moved the line, but none has ended the debate. The progression reveals something architects need to understand before building on any of them: each release has solved a specific, bounded problem, while leaving adjacent problems intact.

linear-attention ssm mamba nvidia nv-labs architecture-research open-source-ai-news ai-models-news inference-efficiency gated-deltanet-2

Competitive models, Mamba-2, KDA, Mamba-3 (vendor benchmark)

Key Takeaways

The Mamba-to-GDN2 progression shows each SSM/linear attention release solving a bounded problem, GDN2's contribution is channel-wise erase/write decoupling, not a wholesale architecture redesign.
Constant-memory decoding is a confirmed structural property; vendor benchmark rankings over
Mamba-2, KDA, and Mamba-3 are self-reported and require independent evaluation before driving architecture decisions.
The retrieval precision question, whether the channel-wise gate improves long-context fact recall relative to prior variants, is the most important thing independent evaluation needs to answer.
Open-source weights without reproducible training configs provide inference capability but not the validation data architects need; confirm repository contents before committing evaluation resources.

Model Release

Gated DeltaNet-2

OrganizationNVIDIA AI (NVlabs)

TypeOpen Source LLM

Parameters1.3B (per NVIDIA technical report)

Benchmark[SELF-REPORTED] Outperforms Mamba-2, Gated DeltaNet, KDA, Mamba-3 on standard LM benchmarks (matched parameter scale), vendor claim, no independent eval

AvailabilityOpen-source weights via GitHub (repository not confirmed live at publication)

Gate Mechanism by Architecture

Gated DeltaNet

Tied scalar delta gate (erase + write coupled)

KDA

Key-value decomposition on associative memory

Mamba-3

Refined selective state space mechanism

Gated DeltaNet-2

Channel-wise erase gate (key axis) + separate write gate (value axis)

Six weeks before NVIDIA’s Gated DeltaNet-2 appeared on arXiv, the same competitive landscape looked like this: Mamba-3 had extended state space model capabilities, KDA had introduced key-value decomposition for more expressive associative memory, and the original Gated DeltaNet had added recurrent gating on top of the Delta rule. Each was a genuine improvement. Each left something unsolved. Gated DeltaNet-2 picks up one specific thread. The memory editing problem the architecture actually solves

The Delta rule, the learning rule underlying DeltaNet and its descendants, works by adjusting a recurrent memory state on each new input. It computes how much to change the memory based on the error between what the memory predicts and what it observes. The original Gated DeltaNet added a scalar gate to control how aggressively that update happens. That gate is tied: a single scalar value governs both erasing old information and writing new information simultaneously. Tied gates work. They’re not broken. But they constrain the architecture’s expressiveness in a specific way: the model can’t selectively erase one feature dimension while writing another. It updates everything at the same rate on every step. According to NVIDIA’s technical report, Gated DeltaNet-2 replaces the tied scalar delta gate with a channel-wise erase gate on the key axis and a separate write gate on the value axis. The paper confirms this reduces sequence mixing to linear time and decoding to constant memory, the core architectural properties that make the entire SSM/linear attention category attractive relative to softmax transformers. That’s a meaningful separation. Erase operates on key channels. Write operates on value channels. The model can now decide, per feature dimension, what survives and what’s overwritten. In theory, that allows more precise information routing through long sequences. The competitive arc: what each release actually addressed

Disputed Claim

Gated DeltaNet-2 outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 on standard language modeling benchmarks

Self-reported benchmarks only; no independent evaluation at publication. Community discussion (LinkedIn, X.com) corroborates the comparison framing but not the specific numerical results.

Treat as a strong research hypothesis. The architecture's structural contribution (channel-wise decoupling) is confirmed; the performance claims require independent validation before informing build decisions.

What to Watch

Epoch AI evaluation of GDN2Unknown, monitor Epoch AI database

NVlabs GitHub repository confirmed live with training configsDays to weeks

Third-party reproduction of benchmark results on held-out tasks4-8 weeks post-repository release

Community eval on retrieval-heavy long-context tasks vs. prior Gated DeltaNet6-12 weeks

Mamba introduced selective state spaces, the ability to filter information based on input content, which vanilla SSMs couldn’t do. That was the breakthrough that made the category serious. Mamba-2 restructured the architecture to use structured state space duality (SSD), connecting SSMs to linear attention more explicitly and enabling more efficient hardware-parallel implementations. It traded some of Mamba’s flexibility for better throughput on accelerators. Gated DeltaNet brought the Delta rule into the gated recurrent framework, combining associative memory with gating. It was the architecture that first showed the Delta rule could compete on standard language modeling benchmarks at matched scale. KDA extended this with key-value decomposition, a way to make the associative memory more expressive by factoring the key and value computations. It addressed the representational capacity concern that tied gates introduced. Mamba-3 pushed the SSM line further with refinements to the selective mechanism and expanded training scale. Gated DeltaNet-2 sits in a different part of the design space. It’s not primarily about representational capacity or hardware efficiency. It’s about the precision of the memory update mechanism. The channel-wise gate decoupling is a targeted intervention on a specific limitation – not a wholesale architecture redesign. According to NVIDIA’s technical report, the architecture outperforms Mamba-2, the original Gated DeltaNet, KDA, and Mamba-3 on standard language modeling benchmarks at matched parameter scale (1.3B parameters, 100 billion FineWeb-Edu training tokens). Those are vendor-reported results with no independent evaluation at time of publication. The comparison models are named correctly in community discussion, the specific benchmark rankings require independent confirmation before they should drive architectural decisions. What constant-memory decoding actually means for infrastructure

Softmax attention’s memory cost grows quadratically with sequence length. A 128K-token context window on a 70B model requires a KV cache that can consume tens of gigabytes of accelerator memory at inference, memory that scales with every token in the sequence. Linear attention and SSM architectures replace this with a fixed-size recurrent state. The memory footprint at decoding doesn’t grow with the context. It stays constant. That’s the property GDN2’s arXiv paper confirms directly. Don’t read that as a free lunch. Constant memory at decoding comes with a different cost: the fixed-size recurrent state is a lossy compression of the full context. Softmax attention has perfect recall for any token in its window. Linear attention must learn to prioritize what to keep. On tasks that require precise retrieval of specific early-context facts, SSMs and linear attention architectures have historically underperformed transformers at matched scale. Whether GDN2’s channel-wise gate improves that retrieval precision is exactly what independent benchmarking needs to test. The architecture doesn’t disclose inference latency figures, cost per token, or hardware requirements in the abstract excerpt available. Those numbers matter as much as benchmark accuracy for any production decision. A linear attention layer that’s faster in theory but requires custom CUDA kernels for efficient execution isn’t immediately deployable for most teams. The verification gap practitioners can’t skip

NVIDIA’s paper is an arXiv vendor technical report, not third-party evaluated. The authors are NVIDIA employees. The benchmarks are self-reported. That’s normal for an initial research release, and it doesn’t diminish the architectural contribution. But it does define what practitioners can and can’t conclude right now. What the paper establishes: the channel-wise gate decoupling is a real architectural change, confirmed at the abstract level. The constant-memory and linear-time properties are stated claims consistent with the broader linear attention framework. What independent evaluation would establish: whether the benchmark rankings hold across diverse tasks and hardware configurations, whether the memory advantage translates to wall-clock latency improvements in real deployment scenarios, and whether the architecture’s retrieval precision has improved meaningfully over prior Gated DeltaNet variants. Epoch AI’s model evaluation database is the clearest signal to watch. When (and whether) GDN2 appears there determines when the vendor claims become actionable evidence. The architect decision guide

Unanswered Questions

Does the channel-wise gate improve long-context retrieval precision over the original Gated DeltaNet, the key quality question independent eval needs to answer?
What are wall-clock inference latency numbers on standard hardware at production batch sizes?
Does the GitHub repository ship with reproducible training configurations to validate the 100B FineWeb-Edu token claim?
How does GDN2's memory-quality tradeoff compare to sliding window attention variants on retrieval-heavy benchmarks?

Analysis

The 18-month SSM/linear attention arc, Mamba through GDN2, shows a pattern: each release solves a bounded problem while the most critical open question (retrieval precision at matched scale vs. softmax transformers) remains empirically contested. GDN2's channel-wise decoupling is the most targeted intervention yet on the memory editing mechanism. Whether it's enough to close the retrieval gap is the central question for this architecture class in 2026.

Right now, GDN2 warrants a specific kind of attention: it’s a research target, not a production decision. Here’s how to frame the evaluation depending on where you are in the stack. GDN2. It’s GDN2 vs. sliding window attention variants and retrieval-augmented context management. Those are your actual alternatives, and they have more deployment data. If you’re already building on a linear attention or SSM backbone, the channel-wise gate decoupling is a meaningful contribution to evaluate. The question is whether it improves performance on your specific retrieval-heavy tasks, which requires running your own evals on the open-source weights, not trusting the vendor benchmark tables. Any architecture you commit to today should be one you can swap out of as the next iteration arrives. The winning architecture in this race probably isn’t the one being announced right now. It’s the one with enough community adoption to generate the real-world evaluation data the vendor papers haven’t produced yet. NVIDIA has indicated weights and code will be made available through GitHub. The repository URL couldn’t be confirmed at publication time, that’s the first practical checkpoint. Verify the NVlabs repository is live and ships with reproducible training configs before committing evaluation resources. If the configs aren’t there, the 100B-token training claim can’t be independently validated. The part nobody mentions in these release cycles: open-source weights without reproducible training configurations are less useful than they appear. You can run inference on them. You can’t confirm how they were built or whether the benchmark conditions match what you’d encounter in your environment. Run the reproduction. Wait for the independent evaluation. Then decide.

More coverage of NVIDIA

Markets Jul 8

Prime Intellect Raises $130M Series A at $1B Valuation to Let Enterprises Train Their...

Markets Jul 5

AI Token Pricing Index Falls Nearly 20% From May Peak, Raising Questions About $700B...

Markets Jul 1

Twelve Labs Raises $100M Series B and Commits Its Models to AWS Trainium Over...

View Source

More Technology intelligence

View all Technology

Gallery

Contacts