kirill_makarov - stock.adobe.com

News

Forget training, find your killer apps during AI inference

Pure Storage executives talk about why most artificial intelligence projects are about inference, during production, and why that means storage must respond to capacity needs and help optimise data management

Antony Adshead, Computer Weekly

Published: 09 Oct 2025 16:15

Most organisations will never train their own artificial intelligence (AI) models. Instead, most customers’ key challenge in AI lies in applying it to production applications and inference, with fine-tuning and curation of data being the core tasks.

Key here are the use of retrieval-augmented generation (RAG) and vector databases, the ability to reuse AI prompts, and co-pilot capabilities that allow users to question corporate information in natural language.

Those are the views of Pure Storage executives who spoke to Computer Weekly this week at the company’s Accelerate event in London.

Naturally, the key tasks identified fit well with areas of functionality added recently to Pure’s storage hardware offer – including its recently launched Key Value Accelerator – and also with its ability to provide capacity on demand.

But they also illustrate the key challenges for organisations tackling AI at this stage in its maturity, which has been called a “post-training phase”.

In this article, we look at what customers need from storage in AI production phases, and with ongoing ingestion of data and inference taking place.

Don’t buy GPUs; they’re changing too quickly

Most organisations won’t train their own AI models because it’s simply too expensive at the moment. That’s because graphics processing unit (GPU) hardware is incredibly costly to buy, and also because it is evolving at such a rapid pace that obsolescence comes very soon.

Most organisations now tend to buy GPU capacity in the cloud for training phases.

It’s pointless trying to build in-house AI training farms when GPU hardware can become obsolete in a generation or two, according to Pure Storage founder and chief visionary officer John Colgrove.

“Most organisations say, ‘Oh, I want to buy this equipment, I’ll get five years of use out of it, and I’ll depreciate it over five or seven years’,” he said.

“But you can’t do that with the GPUs right now. I think when things improve at a fantastic rate, you’re better off leasing instead of buying,” added Colgrove. “It’s just like buying a car. If you’re going to keep it for six, seven, eight years or more, you buy it, but if you’re going to keep it for two years and change to a newer one, you lease it.”

Find your AI killer app

For most organisations, practical exploitation of AI won’t happen in the modelling phase. Instead, it’s going to come where they can use it to build a killer app for their business.

Colgrove gives the example of a bank. “With a bank, we know the killer app is going to be something customer-facing,” he said. “But how does AI work right now? I take all my data out of whatever databases I have for interacting with the customer. I suck it into some other system. I transform it like an old ETL batch process, spend weeks training on it, and then I get a result.

“That is never going to be the killer app. The killer app will involve some kind of inferencing I can do. But that inferencing is going to have to be applied in the regular systems if it’s customer-facing,” said Colgrove.

“That means when you actually apply the AI to get value out of it, you’ll want to apply it to the data you already have, the things you’re already doing with your customers.”

In other words, for most customers, the challenges of AI lie in the production phase, and more precisely the ability to rapidly curate and add data, and run inference on it to fine-tune existing AI models. And then to be able to do that all again when you have the next idea about how to further improve things.

Pure Storage EMEA field chief technology officer Fred Lherault summed it up thus: “It’s really about, ‘How do I connect models to my data?’ Which first of all means, ‘Have I done the right level of finding what my data is, curating my data, making it AI-ready, and putting it into an architecture where it can be accessed by a model?’”

Key tech underpinnings of agile AI

The inference phase has emerged as the key focus for most AI customers. Here, the challenge is to be able to curate and manage the data to build and reiterate on AI models during their production lifetime. That means customers connecting with their data in an agile fashion.

This means the use of technologies that include vector databases, RAG pipelines, co-pilot capability, and prompt caching and reuse.

Key challenges for storage as it relates to these are twofold. It means being able to connect to, for example, RAG data sources and vector databases. It also means being able to handle big jumps in storage capacity, and reducing the need to do so. The two are often connected.

“An interesting thing happens when you put your data into vector databases,” said Lherault. “There’s some computation required, but then the data gets augmented with vectors that can then be searched. That’s the whole goal of the vector database, and that augmentation can sometimes result in a 10 times amplification of data.

“If you’ve got a terabyte of source data you want to use with an AI model, it means you’ll need a 10TB database to run it,” he said. “There’s all of that process that is new for many organisations when they want to use their data with AI models.”

Deal with demands on storage capacity

Such capacity jumps can also occur during tasks such as checkpointing, which can see huge volumes of data created as snapshot-like points to roll back to in AI processing.

Pure aims to tackle these with its Evergreen as-a-service model, which allows customers to rapidly add to capacity.

The company also suggests ways to keep storage volumes from rising too rapidly, as well as speeding performance.

Its recently introduced Key Value Accelerator allows customers to store AI prompts so they can be reused. Ordinarily, a large language model would access cached tokens representing previous responses, but GPU cache is limited, so answers often need to be recalculated anew. Pure’s KV Accelerator allows tokens to be held in its storage in file or object format.

That can speed responses by up to 20 times, said Lherault. “The more you start having users asking different questions, the faster you run out of cache,” he added. “If you’ve got two users asking the same question at the same time and do that on two GPUs, they both have to do the same computation. It’s not very efficient.

“We’re allowing it to actually store those pre-computed key values on our storage so the next time someone asks a question that’s already been asked or requires the same token, if we’ve got it on our side, the GPU doesn’t need to do the computation,” said Lherault.

“It helps to reduce the number of GPUs you need, but also on some complex questions that generate thousands of tokens, we’ve sometimes seen the answer coming 20 times faster.”

Forget training, find your killer apps during AI inference

Pure Storage executives talk about why most artificial intelligence projects are about inference, during production, and why that means storage must respond to capacity needs and help optimise data management

Don’t buy GPUs; they’re changing too quickly

Find your AI killer app

Key tech underpinnings of agile AI

Deal with demands on storage capacity

Read more about AI and storage

Read more on AI and storage

Nvidia prepares for exponential growth in AI inference

Conference news from Pure//Accelerate 2025

Flash, AI and cloud: 3 IT takeaways from Pure Accelerate

Pure looks beyond storage to data management