maciek905 - stock.adobe.com

News

CUDA at 20: From billion-dollar gamble to agentic AI

As Nvidia marks two decades of CUDA, its head of high-performance computing and hyperscale reflects on the platform’s journey, the power of software optimisation, and how the fusion of GPUs and LPUs will shape the future of AI

Aaron Tan, Informa TechTarget

Published: 23 Mar 2026 8:37

When Nvidia first showed off its Compute Unified Device Architecture (CUDA) parallel computing platform in 2006, it was a multibillion-dollar bet that failed to turn a profit for a decade. Today, it is the undisputed software stack that has been credited for much of the company’s success.

Speaking to Asia-Pacific media on the sidelines of GTC 2026 in San Jose, Ian Buck, Nvidia’s vice-president of hyperscale and high-performance computing – and the man who essentially built CUDA – reflected on the platform’s 20-year journey, its importance to Nvidia and the innovations required to support agentic artificial intelligence (AI) workloads.

Reflecting on the platform’s origins, Buck noted that achieving mass adoption required meeting developers where they already were, rather than forcing them to learn an entirely new paradigm for parallel computing.

“What made CUDA successful was that we didn’t try to invent a whole new programming language; that would have been the academic thing,” Buck said. “The most important thing about CUDA was the C programming language. Can I take C and change it as little as possible, but let the program run on 10,000 cores for just the part where it really mattered?”

Nvidia also ensured that CUDA remained compatible across different generations of graphics processing units (GPUs). According to Buck, CUDA 1.0 code written for an early GeForce GPU can run on Nvidia’s latest Vera Rubin architecture “a million times faster”.

The financial risks of marrying CUDA with Nvidia hardware were high, but the company’s leadership was determined to put CUDA in every GPU. “It cost the company billions,” Buck said. “We didn’t make money for 10 years, and we never gave up on it.”

Today, CUDA is available in other programming languages besides C, such as Python, Fortran and Java. The CUDA ecosystem also boasts more than 1,000 CUDA-X software libraries that can be used to power a diverse range of applications, from processing data and images to predicting protein structures.

While some in the industry have questioned whether AI code generation will eventually weaken CUDA’s moat, Buck argued it is having the exact opposite effect.

“It’s actually accelerating CUDA adoption,” Buck said, noting that AI agents are increasingly being used to write and optimise CUDA code, including for kernels that run models like DeepSeek and OpenAI’s GPT-OSS, as well as CUDA-X software libraries.

“We have researchers at Nvidia who are working on Gordon Bell prizes, among other things, and they’re using Claude and Nvidia Warp,” Buck added, referring to Nvidia’s Python framework for writing high-performance simulation and graphics code.

“And their productivity has gone through the roof because the agents now have access to different libraries that they can use to solve [problems in] a particular domain. Agentic coding is a rising tide for all use cases and certainly for the adoption of accelerated computing,” he added.

Inferencing demands

As the industry ramps up on agentic AI – characterised by trillion-parameter models that process hundreds of thousands of tokens of context – Nvidia is doubling down on AI inferencing capabilities following its licensing of Groq’s language processing unit (LPU) technology in late 2025.

Buck described the LPU as a “booster pack” to Vera Rubin, leveraging incredibly fast SRAM memory for matrix math. However, LPUs, each of which has just 500MB of on-chip SRAM, cannot operate efficiently on their own for massive models due to memory constraints.

“Trying to run a trillion-parameter model with just an LPU would take dozens of racks, and it’s simply not economical to bring to scale,” Buck said. “By combining an LPX rack [Nvidia’s LPU-based system] with a Vera Rubin rack, all the attention calculations for every token can happen on the GPU while the matrix math can happen on the LPU, on every layer of the model for every token.”

But unlike GPUs, which rely on massive parallel bandwidth and rich pipelines to keep compute flowing and hide latency, Groq’s LPUs rely on strict scheduling.

“Today, Groq has an amazing compiler that can schedule and program the compute units inside of an LPU chip,” Buck said. Operating at 1,000 tokens per second requires a scheduled architecture with precise timing to ensure every piece of data and compute is ready at exactly the right nanosecond, he explained.

Nvidia’s ultimate goal is to make all its platforms broadly programmable. “It is our intention to open up the programming environment of the LPU. On how we do that in CUDA, or in general, we’ll talk about in the future,” Buck said.

Programmability over custom silicon

Despite the rise of specialised AI chips, Buck defended Nvidia’s commitment to the general-purpose programmability of its chips, pointing to significant performance gains achievable through software optimisation alone.

Buck revealed that a team of 400 Nvidia software engineers recently spent four months optimising the open-weight DeepSeek-R1 model on the GB200 Grace Blackwell system. By implementing 38 major software optimisations – including kernel fusions and tensor parallelism – and using the NVFP4 (four-bit floating point) format, they drastically improved efficiency.

“We increased the performance of DeepSeek-R1 by four times on the same GPU infrastructure. We just increased the revenue of every GB200 by four times, all in software,” Buck said, noting that performance improvements directly equate to revenues for enterprises.

“We can specialise, we can tape out a chip and hard-bake [a model] in,” he added. “But you’re going to miss that opportunity – and the world’s opportunity – to innovate and figure out those new algorithms and techniques. By the way, 95% of the optimisations and things we figured out apply to every model in the ecosystem. And we’ll help define the next model to be even smarter and give it a new starting point.”

Although Nvidia is often seen as a chipmaker, it is equally a software company with its hardware and software deeply co-designed. With every iteration of its AI technology stack, Nvidia has a single architecture team that works on not just the GPU, but also all the optimisations, ecosystem software, and frameworks such as PyTorch and SGLang.

“The benefit of having thousands of software and kernel engineers reporting to the same team that builds the chip means they don’t just go away after they’re done,” he added. “They’ll continue working with the likes of OpenAI, Anthropic, and Microsoft to continuously improve kernel performance.”

During his GTC 2026 keynote address, Nvidia CEO Jensen Huang echoed this sentiment, noting that CUDA is more than just a programming platform; it is the engine of a self-sustaining ecosystem in what he dubbed the “CUDA flywheel”.

“It’s taken us 20 years to build up hundreds of millions of GPUs and computing systems around the world that run CUDA,” Huang said, noting that this installed base attracts developers, drives breakthroughs such as deep learning, and opens up new markets.

Because the software is continuously updated and backward compatible, the useful life of an Nvidia GPU is extended, driving down the cost of computing over time, he added. “This combination of dynamics is what helps the Nvidia architecture expand its reach and accelerate new growth.”

CUDA at 20: From billion-dollar gamble to agentic AI

As Nvidia marks two decades of CUDA, its head of high-performance computing and hyperscale reflects on the platform’s journey, the power of software optimisation, and how the fusion of GPUs and LPUs will shape the future of AI

Inferencing demands

Programmability over custom silicon

Read more about AI in APAC

Read more on Software development tools

Marvell scales up networking to extend Nvidia AI ecosystem

Nvidia expands Vera Rubin platform, details Groq integration

Microsoft Maia 200 AI chip could boost cloud GPU supply

AMD pushes for open ecosystem to challenge Cuda dominance