As AI agents start to move faster than software made for human users, both digital tooling and silicon architecture need to be redesigned to reduce latency and power bottlenecks, according to the chief scientists of Nvidia and Google
Current software tools – from compilers to productivity applications – will need a fundamental redesign to keep pace with artificial intelligence (AI) agents working at machine speed, according to the chief scientists of Google and Nvidia.
Speaking at the recent Nvidia GTC 2026 conference in San Jose, Google chief scientist Jeff Dean noted that while human developers rarely stress over the startup time of a C compiler, traditional tooling will cause massive performance latency in a world where AI agents operate way faster than humans.
Coding tools are already undergoing this shift, Dean said, adding that business applications must follow suit. This will allow agents to manipulate spreadsheets and documents to extract information programmatically, he explained during a wide-ranging conversation with Nvidia chief scientist Bill Dally on advancing the next frontier of AI.
The discussion provided rare insights into the roadmaps of Google and Nvidia as prompt-and-wait AI progresses to agentic systems capable of course-correcting, negotiating, and even designing their own successors.
Autonomous R&D
For Google, advancing to AI’s next frontier means empowering models to act as autonomous R&D laboratories.
Dally pressed Dean on how close the industry is to an AI model that can experiment, curate data, and train the next version of itself. While Dean conceded the ability to do so is “not quite there yet”, he pointed to the emergence of neural architecture search, which allows users to automate the design of neural networks.
“You can specify research spaces in natural language, like ‘please explore interesting new distillation algorithms and try to use information we're not currently using,’” Dean said. “And it will go off and do those experiments. It’s basically a super-powerful multiplier for research and productivity.”
Achieving this will require models to break free from training limits. Instead of pre-training a model on the entirety of the internet’s data all at once, the model could take action or predict answers in some environment before going back to learning, dramatically improving learning efficiency, Dean explained.
‘Speed of light’inference
With inference expected to account for most AI workloads, Nvidia is aggressively targeting communication latency to give AI agents the ability to “think” without pausing.
“As you get down to the right side of that curve, where you’re really optimising for latency, it turns out that the bulk of the delay is communication,” Dally said. “At Nvidia, we always refer to the speed of light.”
To alleviate the need for digital signal processing and error correction, Dally revealed that Nvidia is experimenting with simplified router architectures that sacrifice bandwidth – dropping from 400 to 200 gigabits per second – for latency improvements. The aim is to drive router latency down to under 50 nanoseconds.
I’d love to simply say, ‘Design me a new GPU,’ and I'll go out skiing for a couple of days, and when I come back it’s done
Bill Dally, Nvidia
“By doing that, I can see us running relatively big models at 10,000 to 20,000 tokens per second,” he said.
‘Don’t move the data’
As AI consumes vast amounts of energy, Dally offered a blunt solution for reducing energy consumption: “Don’t move the data. People are laughing, but I’m serious. That’s absolutely what you have to do.”
Dally explained that doing a multiply-add calculation for a low-precision NVFP4 operation uses only 10 femtojoules of energy. However, pulling the necessary data from external memory consumes about 1,000 times that amount of energy.
To solve this, Nvidia is exploring advanced 3D stacking technologies that physically fuse memory and compute. “Most of the energy used in reading DRAM isn’t actually reading the DRAM – it’s moving the bit from where you read the DRAM over to where the pins to the GPU are,” Dally said.
“By stacking the DRAM directly on top of the GPU doing the computation, we can get an order of magnitude more bandwidth with less energy per bit. That winds up being the same power, but with way more performance.”
Beyond hardware innovations, taming the AI power crisis will require massive improvements in algorithmic efficiency. “If you can get the same accuracy with less work, that reduces energy also,” he added.
Dally highlighted sparsity – the technique of skipping calculations for mathematical parameters that don’t significantly affect a model’s final output – as a massive opportunity for power reduction. Nvidia introduced two-to-one structured sparsity with its Ampere architecture, and current mixture-of-experts models use a coarse form of sparsity to save compute.
However, Dally warned that pushing for higher levels of sparsity destroys the highly regular, predictable computation patterns that make GPUs so efficient. “When you disrupt that, you need to have much more control and data routing to deal with the irregular nature,” he said.
AI building its own infrastructure
This agentic future is already taking shape within Nvidia’s and Google’s own engineering teams, where AI is designing the next generation of silicon.
Dean pointed to Google’s success with using AI for placement and routing in chip design – citing its acclaimed AlphaChip research – while Dally elaborated on Nvidia’s use of AI across its design pipeline.
One of Nvidia’s most successful internal tools is NVCell, a reinforcement learning programme. Every time the company moves to a new semiconductor process, engineers must port a standard cell library of up to 3,000 cells. “It used to take a team of eight people about 10 months,” Dally said. “We developed a programme based on reinforcement learning, and the results are actually better than human designs.”
Beyond physical chip layouts, Nvidia has deployed a custom large language model dubbed ChipNeMo to boost engineering productivity.
Trained on Nvidia’s proprietary hardware design documents, ChipNeMo acts as a mentor for junior engineers, saving senior designers from having to explain the basic functions of specific chip components. It can also summarise bug reports and automatically routes them to the right designers for resolution.
Dally hopes AI can eventually automate the most time-consuming parts of chip development. “I’d love to simply say,‘Design me a new GPU,’ and I'll go out skiing for a couple of days, and when I come back it’s done,” he said, though he admitted it’s a long way off from that reality.
Even when that day arrives, Dally expects AI chip designers to rely on a master agent orchestrating specialised sub-agents that negotiate with one another to work out the architecture, replicating the very meetings human engineers hold today.
Read more about AI in APAC
DayOne and Cortical Labs are bringing ‘wetware’ computing to Singapore, using living neurons grown from stem cells to support the demand for AI while addressing sustainability concerns.
Singtel and Nvidia have teamed up on a multimillion-dollar facility to help organisations scale enterprise AI deployments, tackle extreme datacentre power densities, and prepare for the era of embodied AI.
Following the viral success of OpenClaw and product launches from Nvidia and Tencent, Alibaba has unveiled an agentic AI platform that integrates with DingTalk to orchestrate business workflows.
Read more on Artificial intelligence, automation and robotics