Nvidia has unveiled its Vera Rubin compute platform with an architecture designed to power agentic artificial intelligence (AI) systems that think and reason rather than simply retrieve information.

The announcement marks a move by Nvidia to address the exponential rise in AI compute requirements posed by the three laws of scaling: model pre-training, post-training, and test-time scaling, where AI models generate better results by spending more compute cycles thinking during the inference stage.

Speaking at a virtual media briefing ahead of CES 2026, Dion Harris, Nvidia’s senior director of high-performance computing and AI hyperscale infrastructure, detailed the Vera Rubin NVL72, a fully liquid-cooled rack-scale system that integrates six distinct chips, including the new Vera CPU and Rubin graphics processing unit (GPU).

“Over the last year, we’ve seen an incredible leap in the intelligence of language models,” said Harris. “Top models like Kimi K2 Thinking employ reasoning during inference, generating more tokens for better answers. This increase in tokens requires an increase in compute.”

The Vera Rubin platform succeeds the current-generation Blackwell architecture, boasting performance leaps. The new Rubin GPU features high-bandwidth memory with bandwidths of up to 22 terabytes per second and a third-generation transformer engine.

Compared to Blackwell, the Rubin GPU is five times faster for inferencing tasks and 3.5 times faster in crunching training workloads, according to Nvidia. The system is built to handle mixture-of-experts (MoE) models, which require massive all-to-all communication between GPUs.

“Rubin provides the performance necessary for the most demanding MoE models,” Harris said. “With the Vera Rubin architecture, we're helping our partners and customers build the world's largest, most advanced AI systems at the lowest cost.”

On the CPU side, Harris said Vera is built for data movement and agentic processing with 88 custom Olympus Arm cores. “Vera doubles data processing, compression and code compilation performance versus our prior-generation Grace CPU across MoE training and inference,” he added.

A key technical hurdle being addressed by Vera Rubin is the management of KV cache, the context memory required for long-running AI interactions. As AI agents maintain state over time, GPU memory becomes a scarce resource.

To that end, Nvidia announced the inference context memory storage platform that creates a tier of memory specifically for inference. Placed between the GPU and traditional storage, it is powered by Nvidia’s BlueField-4 data processing unit (DPU) and Spectrum-X Ethernet networking.

“Compared to traditional network storage used in inference contexts, this platform delivers up to five times more tokens per second, five times better performance per TCO [total cost of ownership] dollar, and five times better power efficiency, which translates directly into higher throughput, lower latency, and more predictable behaviour,” Harris said.

Nvidia confirmed that Vera Rubin-based products will be available from partners in the second half of 2026, with Microsoft Azure and CoreWeave among the first cloud service providers to deploy instances.