Computational storage: OctoML - A tale of two workloads

Software runs on data and data is often regarded as the new oil. So it makes sense to put data as close to where it is being processed as possible, in order to reduce latency for performance-hungry processing tasks.

Some architectures call for big chunks of memory-like storage located near the compute function, while, conversely, in some cases, it makes more sense to move the compute nearer to the bulk storage.

In this series of articles we explore the architectural decisions driving modern data processing… and, specifically, we look at computational storage

The Storage Network Industry Association (SNIA) defines computational storage as follows:

“Computational storage is defined as architectures that provide Computational Storage Functions (CSF) coupled to storage, offloading host processing or reducing data movement. These architectures enable improvements in application performance and/or infrastructure efficiency through the integration of compute resources (outside of the traditional compute & memory architecture) either directly with storage or between the host and the storage. The goal of these architectures is to enable parallel computation and/or to alleviate constraints on existing compute, memory, storage and I/O.”

This piece is written by Jason Knight in his capacity as chief product officer (CPO) at machine learning model deployment specialist OctoML — a company known for its acceleration platform that helps software engineering teams deploy machine learning models on any hardware, cloud provider or edge device.

Knight’s full title for this piece reads: A Tale of Two Machine Learning Workloads: How Some ML can Benefit From Computational Storage — and he writes as follows…

Modern machine learning (ML) workloads are driving enormous rises in the use of data and compute. A recent analysis by OpenAI showed that compute requirements in deep learning (a popular subset of machine learning) doubles every 3.4 months. Utilising computational storage has the potential to reduce cost and power for some portions of the machine learning lifecycle, whereas others are actually traveling in the opposite direction towards highly specialized and centralised compute.

ML compute centralising & specialising

In the beginning stages of building a machine learning model, large datasets need to be collected, cleaned and annotated and then processed in a step commonly known as ‘training’. This training procedure is a computationally demanding process. These training runs require exaflops of compute on data… and the compute density and communication overheads required make this process not particularly well suited for computational storage.

Instead, machine learning model training at scale uses increasingly specialised accelerators at the node, rack and multi-rack level. This can be seen most clearly with Google’s recent unveiling of their fourth generation TPU ‘pod’ system, which is entirely specialised for training.

In these architectures, data is streamed from general purpose storage into the computationally dense ‘pod’ just in time before it is needed.

Batch ML inference

So next, think about the fact that batch ML inference is amenable to computational storage.

While training is not a good fit for computational storage approaches due to the high compute density requirements, machine learning inference often can be. Once a model is trained, applying that model to large datasets requires significantly less compute density and often a higher amount of total data volume than at training time. This difference makes pushing the model closer to the data in a computational storage fashion significantly more appealing.

OctoML’s Knight: Split the streams, but don’t ever cross the beams.

One particularly common and appealing example of this is with video analysis. Applying deep learning models to offline video streams in a classical split storage and compute model involves moving a large amount of video data from storage to compute, whereas significant bandwidth can be saved by doing video decode and executing the deep learning model next or inside the storage device.

Depending on the video compression rates and the ratio of embedded compute available to the density of the storage, then distributing inference across many storage devices in this fashion can lead to high amounts of parallelism and in batch analysis scenarios (where latency is not a concern) then significant cost savings can be achieved.

Another aspect of offline, batched machine learning inference that fits well with the computational storage paradigm is that the output of machine learning models is usually significantly smaller (storage-wise) than the input. So using the video analysis example use case again: while video files are typically on the order of 3-5 MBit/sec compressed, the output of deep learning models are usually either single to hundreds of bytes in the case of classification, or tens to hundreds of bytes per second in the case of rolling window activity classification, object detection, or segmentation. With this, computational storage can act as an initial processing layer and communicate the resulting, much smaller representation of that video to another process for analytics, filtering, or further processing.

One more aspect of computational storage and ML that we have yet to see, is how much acceleration of ML workloads we will see by storage vendors.

While general purpose CPUs might be enough to unlock some savings and use cases for ML, unless we also see acceleration such as ARM’s new Ethos IP make its way into computational storage implementations, then I expect there will still be a large set of use cases, even in batch inference, where it is worth shuttling the data from storage to dedicated ML acceleration for any ML task.