Nvidia Specializes GPU for First Stage of Transformer Inference

At the AI Infra Summit, Nvidia VP of HPC and Hyperscale Ian Buck announced that the next generation of Nvidia GPUs will have a specialized family member designed specifically for the initial part of transformer inference workloads.

www.eetimes.com, Sept. 10, 2025 –

SANTA CLARA, Calif. – At the AI Infra Summit, Nvidia VP of HPC and Hyperscale Ian Buck announced that the next generation of Nvidia GPUs will have a specialized family member designed specifically for the initial part of transformer inference workloads.

“In inference, especially, performance is important,” Buck said. “Performance really is paramount to generating tokens… and [inference] is performance to revenue, where training tends to be about capability per cost.”

This new direction reflects widely accepted practice in neoclouds and other types of deployments where smaller, less expensive GPUs are used for the prefill stage of transformer workloads, with bigger, more expensive GPUs kept for the decode stage. (Nvidia calls these stages the context phase and the generation phase).

The prefill stage ingests and analyses the incoming context data, often done in a massively parallel way, which is compute-bound. The decode stage is where tokens are generated; it is autoregressive, that is, sequential output tokens are generated sequentially, and the process is memory-bandwidth-bound. Using heterogeneous hardware for the two stages can maximise tokens-per-dollar for token factories, but there are some subtleties – the KV cache has to be transferred between the two types of hardware quickly, for example.

This technique, called “disaggregated inference,” can produce double to quadruple the tokens for workloads like Llama with Blackwell-generation GPUs, Buck said.

The new Rubin-generation GPU designed for prefill will be called Rubin-CPX. It will offer 30 PFLOPS of NVFP4 compute, 128 GB GDDR7 memory (not the faster, more expensive HBM), and will triple the attention performance compared to previous-gen GB300 NVL72 thanks to new attention acceleration cores. It will also offer high speed video codec acceleration. The GPU is a single large die.

A Vera Rubin NVL144 CPX rack will contain 144 Rubin-CPX GPUs, 144 Rubin GPUs and 36 Vera CPUs.

Buck said that $100 million in capex investment into CPX racks with Nvidia’s Dynamo inference cluster orchestration software would deliver as much as $5 billion in revenue for token factories (a 30-50x return on investment (ROI), depending on workload context lengths and some other factors).

The biggest impact on ROI will be felt on workloads that need extremely large context lengths, Buck said, noting that many of today’s models can accept context lengths up to 256,000 tokens.

Two examples of high-value applications that need large context lengths are advanced coding chatbots, where the user enters the entire program being worked on, and video generation for entertainment, media and marketing.

Rubin-CPX will be available by the end of 2026.

Click here to read more