What are the storage requirements for AI training and inference?

Storage for AI must cope with huge volumes of data that can multiply rapidly as vector data is created, plus lightning-fast I/O requirements and the needs of agentic AI

Despite ongoing speculation around an investment bubble that may be set to burst, artificial intelligence (AI) technology is here to stay. And while an over-inflated market may exist at the level of the suppliers, AI is well-developed and has a firm foothold among organisations of all sizes.

But AI workloads place specific demands on IT infrastructure and on storage in particular. Data volumes can start big and then balloon, in particular during training phases as data is vectorised and checkpoints are created. Meanwhile, data must be curated, gathered and managed throughout its lifecycle.

In this article, we look at the key characteristics of AI workloads, the particular demands of training and inference on storage I/O, throughput and capacity, whether to choose object or file storage, and the storage requirements of agentic AI.

What are the key characteristics of AI workloads?

AI workloads can be broadly categorised into two key stages - training and inference.

During training, processing focuses on what is effectively pattern recognition. Large volumes of data are examined by an algorithm – likely part of a deep learning framework like TensorFlow or PyTorch – that aims to recognise features within the data.

This could be visual elements in an image or particular words or patterns of words within documents. These features, which might fall under the broad categories of “a cat” or “litigation”, for example, are given values and stored in a vector database.

The assigned values provide for further detail. So, for example “a tortoiseshell cat”, would comprise discrete values for “cat” and “tortoiseshell”, that make up the whole concept and allow comparison and calculation between images.

Once the AI system is trained on its data, it can then be used for inference – literally, to infer a result from production data that can be put to use for the organisation.

So, for example, we may have an animal tracking camera and we want it to alert us when a tortoiseshell cat crosses our garden. To do that it would infer the presence or not of a cat and whether it is tortoiseshell by reference to the dataset built during the training described above.

But, while AI processing falls into these two broad categories, it is not necessarily so clear cut in real life. It will always be the case that training will be done on an initial dataset. But after that it is likely that while inference is an ongoing process, training also becomes perpetual as new data is ingested and new inference results from it.

So, to labour the example, our cat-garden-camera system may record new cats of unknown types and begin to categorise their features and add them to the model.

What are the key impacts on data storage of AI processing?

At the heart of AI hardware are specialised chips called graphics processing units (GPUs). These do the grunt processing work of training and are incredibly powerful, costly and often difficult to procure. For these reasons their utilisation rates are a major operational IT consideration - storage must be able to handle their I/O demands so they are optimally used.

Therefore, data storage that feeds GPUs during training must be fast, so it’s almost certainly going to be built with flash storage arrays.

Another key consideration is capacity. That’s because AI datasets can start big and get much bigger. As datasets undergo training, the conversion of raw information into vector data can see data volumes expand by up to 10 times.

Also, during training, checkpointing is carried out at regular intervals, often after every “epoch” or pass through the training data, or after changes are made to parameters.

Checkpoints are similar to snapshots, and allow training to be rolled back to a point in time if something goes wrong so that existing processing does not go to waste. Checkpointing can add significant data volume to storage requirements.

So, sufficient storage capacity must be available, and will often need to scale rapidly.

What are the key impacts of AI processing on I/O and capacity in data storage?

The I/O demands of AI processing on storage are huge. It is often the case that model data in use will just not fit into a single GPU memory and so is parallelised across many of them.

Also, AI workloads and I/O differ significantly between training and inference. As we’ve seen, the massive parallel processing involved in training requires low latency and high throughput.

While low latency is a universal requirement during training, throughput demands may differ depending on the deep learning framework used. PyTorch, for example, stores model data as a large number of small files while TensorFlow uses a smaller number of large model files.

The model used can also impact capacity requirements. TensorFlow checkpointing tends towards larger file sizes, plus dependent data states and metadata, while PyTorch checkpointing can be more lightweight. TensorFlow deployments tend to have a larger storage footprint generally.

If the model is parallelised across numerous GPUs this has an effect on checkpoint writes and restores that mean storage I/O must be up to the job.

Does AI processing prefer file or object storage?

While AI infrastructure isn’t necessarily tied to one or other storage access method, object storage has a lot going for it.

Most enterprise data is unstructured data and exists at scale, and it is often what AI has to work with. Object storage is supremely well suited to unstructured data because of its ability to scale. It also comes with rich metadata capabilities that can help data discovery and classification before AI processing begins in earnest.

File storage stores data in a tree-like hierarchy of files and folders. That can become unwieldy to access at scale. Object storage, by contrast, stores data in a “flat” structure, by unique identifier, with rich metadata. It can mimic file and folder-like structures by addition of metadata labels, which many will be familiar with in cloud-based systems such as Google Drive, Microsoft OneDrive and so on.

Object storage can, however, be slow to access and lacks file-locking capability, though this is likely to be of less concern for AI workloads.

What impact will agentic AI have on storage infrastructure?

Agentic AI uses autonomous AI agents that can carry out specific tasks without human oversight. They are tasked with autonomous decision-making within specific, predetermined boundaries.

Examples would include the use of agents in IT security to scan for threats and take action without human involvement, to spot and initiate actions in a supply chain, or in a call centre to analyse customer sentiment, review order history and respond to customer needs.

Agentic AI is largely an inference phase phenomenon so compute infrastructure will not need to be up to training-type workloads. Having said that, agentic AI agents will potentially access multiple data sources across on-premises systems and the cloud. That will cover the range of potential types of storage in terms of performance.

But, to work at its best, agentic AI will need high-performance, enterprise-class storage that can handle a wide variety of data types with low latency and with the ability to scale rapidly. That’s not to say datasets in less performant storage cannot form part of the agentic infrastructure. But if you want your agents to work at their best you’ll need to provide the best storage you can.

Read more about AI and storage

Read more on AI and storage