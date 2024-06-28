In this podcast, we look at artificial intelligence (AI) and data storage with Grant Caley, UK and Ireland solutions director for NetApp.

He talks about the need for storage scalability and performance, as well as hybrid cloud, access to all three hyperscalers, and the ability to move, copy and clone data for wrangling prior to inference runs.

Caley also talks about the importance of application programming interface (API) integration, a standardised data layer that can connect into Kubernetes, integration with Python, workflow platforms such as Kafka, and Nvidia microservices and frameworks such as NIM and NEMO.

Antony Adshead: From the point of view of storage, what’s different about AI workloads?

Grant Caley: Traditional enterprise workloads are fairly well-defined as to the characteristics of that workload, the requirements for that workload.

With AI, it’s completely different. AI starts off being very small in terms of development, but it can rapidly scale to multi-petabyte production installations that span not just on-premise but the cloud as well.

When you’re looking at it from an AI workload perspective, it’s almost completely different from a kind of siloed, focused enterprise application. That means you’re having to cope with different performance requirements. The capacities you have to host for AI from a data perspective go from just gigabytes to petabytes of data, which has its own challenges.

From an AI workload perspective, you’re often having to wrangle large datasets, move them around, clone them, copy them, get them ready for cleaning and inputting, and then use them for inferencing.

There’s a lot of high maintenance that goes around the kind of requirements that sit with AI as well. And another interesting fact is that we see now that AI is not just an on-premise play. It’s AWS [Amazon Web Services], Azure and Google Play, as well.

Customers are developing and leveraging all of those environments as well as their datacentres to deliver AI. And from what we’ve seen recently, AI is becoming the IP of the company, the data it leverages and the output it produces. Security of that data is critical, being able to evidence the data, checkpoint it, version it, because of some of the laws that are coming in around AI.

All of that makes a massive difference to how we have to treat it. And then ultimately, if you look at AI in general compared with any enterprise workload, the actual workflow is really complex and you have to kind of factor that into how you deliver for AI. So, there’s a lot going on that’s different about workloads in an AI context.

What does storage need to cope with AI workloads?

Caley: It kind of builds on the last answer I gave. As customers start developing AI, they often start off in the cloud because the tool sets are there – the platforms – they don’t have to spend a lot of money building environments. So, you have to be able to leverage the cloud.

But equally, a lot of customers are doing it on-premise. They’re building small GPU [graphics processing unit] platforms in servers, they’re developing into bigger DGX or Nvidia SuperPods and those types of configurations.

What’s key underneath all of that from a storage perspective is the data that drives the outcomes they’re trying to do. Whether it’s the early development stages in the cloud or moving to first step production on-premise, to how they push out data for inferencing where it’s actually needed.

That could be small factories, remote sites, whatever that happens to be. So, data mobility from the storage layer is actually key, and that means you have to not build storage silos for each of those use cases.

You have to really try to straddle those use cases and deliver something that delivers data mobility. We used to talk about delivering a data fabric, but it’s that kind of interconnectivity that’s really important.

I think the other thing for AI is that it starts off low-performance when you’re doing your first early stages of training, but that can rapidly scale.

So, performance is a big factor. You need to know that the storage can deliver from the small requirements through to the productionised and the scale requirements. And, a lot of companies forget about that when they go to production. They have created these silos of different types of storage, not realising that ultimately at some point they’re going to have to scale those significantly.

And scale is another factor the storage has to deliver. As I said, it could be gigabytes in the early days, but rapidly that can become petabytes, particularly as companies bring datasets together to try and maximise the training value and the outcomes they can deliver.

But, of course, the data is the IP of the company.

You have to put that into a storage infrastructure that delivers zero-trust administration. [So] that [it] delivers security encryption of the data, that it can make – if you’re doing versioning and kind of evidence-based [work] – those results immutable or indelible so that you can potentially prove the data as it was and the stages it went through.

There’s a lot of things you need to do. And I think the final thing on what data storage needs to deliver is you need to be able to deliver integration into all the tools the customer is looking to use.

They’re looking at Kubernetes workloads, delivering it through Kubernetes. They’re looking at using different frameworks on-premise in the cloud. Your storage layer, if it's going to deliver real value, has to be able to API integrate into all those different environments to maximise the capabilities that can be delivered from the storage layer itself.