Training a model and running an agent are not the same problem.
Training is batch-oriented, latency-tolerant, and measured in tokens processed per hour across massive parallel runs. Inference for agentic workloads is interactive, latency-sensitive, and measured in tokens per second per user across concurrent long-context sessions, each one maintaining a context window that can stretch to hundreds of thousands of tokens, each one calling external tools, reading memory, and looping back on its own outputs. The hardware that’s efficient for the first task is poorly suited to the second. This isn’t a new observation in AI infrastructure circles, but nobody has built rack-scale hardware explicitly for the agentic pattern until now, or at least, nobody has announced it at the scale NVIDIA announced on May 15.
The Vera Rubin Platform pairs the NVL72 compute engine with the Groq 3 LPX inference accelerator in a co-designed rack-scale system. The NVL72 handles compute density. The Groq 3 LPX handles inference latency, NVIDIA’s developer documentation describes it as engineered for “deterministic execution” in intelligent agent systems, meaning predictable, bounded latency under variable load. The platform is designed to run trillion-parameter mixture-of-experts models with context windows of up to 400,000 tokens. These are real architectural choices in response to real workload requirements. The co-design rationale, building the inference accelerator and the compute engine as a single rack-scale unit rather than assembling from discrete components, reduces the memory bandwidth bottleneck that typically limits long-context inference on standard GPU clusters.
The 35x Claim: What to Verify Before You Build On It
NVIDIA reports the Vera Rubin NVL72 delivers 35x higher inference throughput per megawatt compared to the GB200 NVL72 for agentic workloads, according to the company’s internal evaluation. NVIDIA states the platform sustains 400 tokens per second per user on trillion-parameter models. These are significant figures if they hold.
They haven’t been independently verified.
The SemiAnalysis newsletter and HashRateIndex have both reported on the claims, and both treat the figures as NVIDIA’s characterization rather than established benchmarks. The cross-reference data available at time of publication confirms NVIDIA makes these claims consistently across multiple pages and contexts. It doesn’t confirm the claims are accurate. The 400 tokens per second figure is particularly unresolved, adjacent TPS data for trillion-parameter inference exists from other infrastructure contexts (Clarifai has reported 414 TPS on Kimi K2.5 on separate infrastructure), but those numbers don’t validate NVIDIA’s specific platform claim.
Epoch AI hasn’t evaluated Vera Rubin. That’s not unusual for a platform this new, but it’s the gap that matters most. Epoch’s benchmark and compute tracking carries independent methodological weight. When that evaluation does come, the throughput-per-megawatt figure and the tokens-per-second claim are the two numbers worth scrutinizing most carefully, because they’re the ones most likely to be test-condition-specific rather than production-load-representative. Controlled benchmarks typically use fixed context lengths, clean prompt structures, and steady-state load. Real agentic workloads don’t behave that way. Variable context, tool-call overhead, error recovery loops, and concurrent session management all introduce latency variance that vendor benchmarks routinely undercount.
The recommendation: don’t use 35x as a planning assumption. Use it as a due diligence target.
The Pattern: Compute Concentration and Infrastructure Purpose-Building
Vera Rubin doesn’t arrive in isolation. It’s the latest data point in a trend that’s been accumulating in the AI infrastructure story for the past several months.
Prior coverage on why frontier labs are building their own compute stacks captured the upstream dynamic: AI labs have been moving from hyperscaler dependency toward owned or co-owned infrastructure. The analysis on five-year compute contracts documented how that shift is being locked in contractually, long-term commitments that shape which organizations control the compute layer for the next generation. The coverage on hyperscalers as capital infrastructure described the broader financial architecture of that concentration.
Vera Rubin is NVIDIA’s answer to what that compute layer looks like when it’s purpose-built for the agent era rather than retrofitted from training infrastructure. The co-design approach, building inference acceleration into the rack architecture rather than treating it as a bolt-on, is a structural commitment to agentic workload patterns, not a marginal spec improvement.
The pattern also connects to the energy economics story. Throughput per megawatt is the metric NVIDIA chose to headline the 35x claim. That’s not an accident. At rack scale, with 24/7 interactive inference load rather than batch training runs, power efficiency becomes a primary cost driver. The AI infrastructure investment context from markets coverage shows hyperscalers are under pressure on energy costs as AI workloads consume an increasing share of data center capacity. A genuine 35x improvement in inference efficiency per megawatt would be a real cost argument at scale. The question is whether it survives contact with production workloads.
Procurement Implications: Questions to Ask Before Committing
The platform is enterprise and OEM only. Pricing is undisclosed. That combination shapes who can actually evaluate this in a procurement cycle and on what timeline.
For organizations at hyperscaler scale, or with existing OEM relationships, Vera Rubin enters the evaluation pipeline now. For mid-market AI teams, the realistic access path is either a hyperscaler instance type built on this infrastructure or an OEM server product, both of which typically lag the underlying platform announcement by 12 to 18 months. Treat this as infrastructure to plan for, not infrastructure to procure this quarter.
The procurement questions that matter most, in priority order:
First, what workload specifically? The platform is designed around trillion-parameter MoE models with 400K context windows. If your production agents are running on 7B or 70B parameter models with 8K to 32K context, the efficiency gains may not translate, smaller models on standard GPU clusters may still be more cost-effective. The 35x figure is for the specific model class NVIDIA targeted. Confirm your workload matches that profile before the number means anything.
Second, what does “deterministic execution” actually guarantee? The Groq 3 LPX latency claim is a design intent, not a contracted SLA. Before committing to rack-scale infrastructure on that basis, get specific commitments on P99 latency under production load conditions, concurrent sessions, variable context lengths, tool-call overhead included.
Third, what’s the total cost of ownership over 36 months? The throughput-per-megawatt figure addresses energy efficiency. It doesn’t address amortized hardware cost, cooling infrastructure requirements, or the operational complexity of managing a new rack-scale architecture. The full TCO picture won’t be clear until independent buyers run operational deployments.
Don’t expect NVIDIA to answer these questions for you.
TJS Synthesis
Vera Rubin is the most architecturally coherent infrastructure announcement for the agent era to date. The co-design rationale is sound, the workload targeting is specific, and the energy efficiency framing reflects where data center cost pressure is actually heading. None of that means the performance numbers are accurate.
Wait for Epoch AI or equivalent third-party evaluation before treating any Vera Rubin benchmark as a planning input. Specifically: if the 35x throughput-per-megawatt figure holds under real production agentic workload conditions, variable context, concurrent sessions, tool-call overhead, that’s a significant infrastructure inflection. If it only holds under controlled benchmark conditions, it’s a strong marketing number and a weak procurement basis. You won’t know which it is until Q3 2026 at the earliest. Plan accordingly.