The numbers keep going up. That’s the story AI coverage usually tells. Bigger models, longer context windows, higher benchmark scores. What that framing misses is the more interesting architectural question underneath: where is the computation actually going?
NVIDIA’s Nemotron 3 Super, announced March 10, has 120 billion parameters. That headline figure is nearly meaningless on its own. The number that matters is 12 billion, the active parameters at inference. The model activates roughly 10% of its total parameters for any given forward pass. The other 90% sit idle until the routing logic decides they’re relevant.
That’s mixture-of-experts (MoE) architecture in practice.
What MoE Actually Does
In a dense model, every parameter participates in every inference. A 70-billion-parameter dense model costs you 70 billion parameters worth of computation per token, every time. MoE breaks the model into specialized subnetworks, “experts”, and routes each input to a small subset of them. The result is a model with the capacity of a large architecture but the inference cost of a much smaller one.
Nemotron 3 Super’s 120B-total / 12B-active design means you get the representational range of a 120-billion-parameter model at roughly the compute cost of a 12-billion-parameter one. According to NVIDIA’s internal evaluation, the model delivers up to 5x higher throughput than the previous Nemotron Super. That figure is vendor-reported, and independent benchmarks haven’t confirmed it yet, but the architectural reason such gains are plausible is the MoE routing itself, not marketing.
Where Mamba Comes In
The second architectural choice in Nemotron 3 Super is the Mamba component. Standard Transformer attention has a well-documented problem at long context: computational cost scales quadratically with sequence length. Double the context, quadruple the attention cost (roughly). For most tasks, this is manageable. For agentic AI systems, it becomes a serious constraint.
Agentic systems accumulate context fast. A multi-step reasoning loop, a tool-use chain, a retrieval-augmented workflow with large retrieved passages, these don’t fit comfortably in a 128K-token window, and they stress the attention mechanism hard. NVIDIA’s native 1-million-token context window for Nemotron 3 Super doesn’t come for free. It requires a different computational approach.
Mamba is a state-space model (SSM). Instead of attending to the full sequence at every layer, Mamba maintains a compressed state that updates as new tokens arrive. The computational cost scales linearly with sequence length, not quadratically. At very long contexts, that difference compounds. The Mamba-Transformer hybrid in Nemotron 3 Super uses each component where it performs better, attention for local context relationships where it excels, Mamba’s linear recurrence for the long-range dependencies where attention becomes expensive.
Why This Matters for Agentic Workloads Specifically
Three characteristics of agentic AI systems make this architecture particularly relevant.
First, context accumulation. An orchestration loop running multiple tool calls builds a long context fast. The 1M-token native window isn’t useful unless you can process it at practical latency. The hybrid SSM approach is what makes that window usable, not just listed on a spec sheet.
Second, throughput under parallelism. Production agentic deployments rarely run one agent. They run dozens or hundreds simultaneously. Throughput per dollar matters enormously. The MoE routing that drives NVIDIA’s reported throughput gains directly addresses this constraint, though developers should treat vendor throughput figures as a starting point for their own benchmarking, not a deployment guarantee.
Third, the open model factor. Nemotron 3 Super is released as an open model. Developers can run it on their own infrastructure, fine-tune it for domain-specific agentic tasks, and audit its behavior in ways closed API models don’t permit. For organizations with data sensitivity requirements or compliance obligations around where their inference happens, that matters.
What Practitioners Should Evaluate
This architectural shift, sparse activation via MoE, linear-cost long-context via SSMs, open weights, isn’t isolated to NVIDIA. It’s the same pattern driving the efficiency focus across multiple labs this cycle. The AMD CPU orchestration brief in this hub covers the related infrastructure story: the efficiency-over-scale architecture shift is showing up in silicon decisions, not just model design.
Before committing to Nemotron 3 Super for production agentic workloads, practitioners should test on their actual workload distribution, not synthetic benchmarks. NVIDIA’s throughput figures are from internal evaluation. Real throughput depends on batch size, sequence length distribution, hardware configuration, and the specific mix of expert activations your use case drives. The architectural advantages are real. The specific numbers need your validation.
The pattern, though, is clear. 2026 AI architecture is converging on sparse, hybrid, efficient designs. The relevant question for infrastructure decisions isn’t which model has the highest parameter count. It’s which architecture matches your inference cost structure, context requirements, and deployment constraints. Nemotron 3 Super is the latest data point in that answer, and it’s worth understanding why it’s built the way it is.
Related: NVIDIA’s March 10 announcement also included the Vera Rubin platform, the strategic infrastructure angle is covered in NVIDIA Releases Nemotron 3 Super and Announces Vera Rubin and the investor perspective in NVIDIA Is Building Both the Model and the Machine.