Nvidia Launches First GPU Purpose-Built for Million-Token AI Inference

The Rubin CPX GPU brings million-token context windows to coding and generative video, with 8 exaflops of AI performance in a single rack.

News September 18, 2025 by Joshua Tidwell

Artificial intelligence has reached a point where models are expected to not only generate text or images, but also reason over entire codebases, lengthy documents, or hours of video. That kind of workload doesn’t just test an algorithm’s smarts. Standard GPUs, designed for shorter sequences, choke when asked to keep track of millions of tokens at once. Developers working with advanced coding assistants or long-form video generation have hit the ceiling of what traditional systems can handle.

Nvidia’s Rubin CPX

AI workloads are outgrowing traditional GPUs. Nvidia’s Rubin CPX is built to handle million-token inference for codebases and long-form video.

At its AI Infra Summit this week, Nvidia unveiled the Rubin CPX GPU, the first processor built specifically to solve this scaling problem. Rather than tweaking existing architectures, Rubin CPX is designed from the ground up for what Nvidia calls “massive-context inference.” In practice, this means AI systems powered by Rubin CPX can parse a million tokens or more in real time—enough to capture the detail of an entire software repository or a full-length film without losing coherence.

What Sets Rubin CPX Apart

Rubin CPX sits at the heart of Nvidia's new Vera Rubin NVL144 CPX platform. A single rack combines 144 of these GPUs with Rubin GPUs and Vera CPUs, delivering an eye-catching 8 exaflops of AI compute—about 7.5 times more performance than the previous-generation GB300 NVL72. With 100 terabytes of memory and bandwidth peaking at 1.7 petabytes per second, the platform is built to feed large models without bottlenecks.

Rubin CPX

Rubin CPX accelerates the compute-heavy context stage in disaggregated inference.

At the chip level, Rubin CPX delivers up to 30 petaflops of compute using NVFP4 precision, a format Nvidia optimized for efficient inference. It pairs that with 128 GB of GDDR7 memory, tripling attention speeds compared to GB300-based systems. Hardware support for both video decoding and encoding means developers working on generative video don’t need extra accelerators.

Who’s Using It First

Several early adopters are already putting Rubin CPX to work. Cursor, a startup developing an AI-driven code editor, expects the GPU to shift software engineering from basic code suggestions to true project-wide reasoning. By accelerating code generation and analysis, the platform aims to boost developer productivity and streamline the entire creation process.

In the creative space, Runway is targeting longer-form generative video projects. The company views Rubin CPX as a performance leap that enables faster rendering, more realistic effects, and new degrees of flexibility in video production. For individual artists, this could mean shorter turnaround times, while large studios may gain the ability to build entirely new visual effects workflows.

AI research company Magic is looking at the software engineering side. CEO Eric Steinberger explained that Rubin CPX allows their models to load a hundred million tokens of code, history, and documentation in one go, which allows fine-tuning. Engineers can then guide the model through direct interaction.

Economics and Infrastructure

Nvidia presents Rubin CPX as both a technical advance and a financial opportunity, estimating that every $100 million invested in the platform could yield up to $5 billion in token revenue. That kind of return, if borne out in real deployments, would reset the benchmarks for AI investment.

Rubin CPX combines specialized hardware, full-stack software, and ecosystem support

Rubin CPX combines specialized hardware, full-stack software, and ecosystem support to make long-context AI practical and economically viable.

Instead of handling everything on a single chip, Nvidia's Rubin CPX is designed for a split approach: Rubin CPX tackles the context-heavy front end while Rubin GPUs and Vera CPUs manage the generation phase. Think of it as dividing up the work so each processor plays to its strengths. Nvidia's Dynamo software keeps the handoff smooth, moving data quickly between memory and compute so the system doesn’t stumble.

Built for Nvidia's AI Ecosystem

Rubin CPX arrives as part of a broader rollout and is packaged with Nvidia's complete AI stack. That includes the Dynamo platform to coordinate workloads and the Nemotron family of multimodal models designed for advanced reasoning.. Enterprises can deploy through Nvidia AI Enterprise, which bundles frameworks, NIM microservices, and support tools. For developers, the Cuda-X ecosystem spans six million coders and thousands of applications, allowing the GPU to be slotted into existing pipelines without a steep learning curve.

The challenge of long-context AI has been obvious for years: models couldn’t hold onto enough information to feel truly intelligent across extended tasks. With Rubin CPX, Nvidia is making a bold attempt to fix that bottleneck at the hardware level. If it succeeds, software engineering, video production, and multimodal AI could all shift gears, moving from clever assistants to systems that really understand the full scope of a project.

All images used courtesy of Nvidia.