This is a guest post for the Computer Weekly Developer Network written by Curtis Anderson in his capacity as senior software architect at Panasas — a company known for its PanFS parallel file system on the ActiveStor ‘ultra turnkey’ appliance for HPC workloads.
Anderson is responsible for advanced projects and future planning at Panasas, concentrating on OSD (Object Storage Device) architecture and client strategy… and his comments come in the wake of Computer Weekly’s series and feature devoted to examining computational storage.
Anderson writes as follows…
As we have learned from discussions already here, computational storage alleviates some challenges, but it brings others. What we can say is that good architecture for storage subsystems will balance the gains against the losses.
For example, the processing capability inside a Computational Storage (“CS”) device will have higher bandwidth and IOPs to the storage media than the CPU in the compute node holding the CS device. That’s great for media-bottlenecked tasks, but since the media within a CS device only contains what it contains, if the data you want is stored on some other CS device, this one will sit idle.
Unless you’ve planned for a good technique to distribute both the data and the workload, you will have bottlenecks that limit aggregate performance… and so, wasted capital in terms of the purchase cost of the CS devices sitting idle for some part of their day.
Another challenge is [the job of] writes.
A CS device can very likely do a great job of committing complex updates to its local storage. But, if that CS device or the computer that is hosting it goes down, that data and those updates are inaccessible.
Replicating the data and applying identical updates in parallel on multiple CS devices is an obvious answer, but that comes with the computational cost of 3x more transactions overall and potentially significant slowdowns if synchronous replication of the results of the transaction was chosen in the architecture instead of totally parallel transactions.
If the problem that you’re trying to architect for is a bandwidth problem (eg: searching for a needle in a haystack) versus it being a latency problem (ie: what ad should I put up next to the page the user is seeing?), the tradeoffs will be potentially quite different.
Application designer responsibilities
The whole point of CS devices is that they have higher performance access to their internal storage than the main CPU does. An application designer needs to identify a subset problem of their overall problem that has high data access requirements and very little cross-talk to other places where data is stored.
When it comes to refactoring for a CS-compliant future, the biggest single challenge for nearly all new technologies in the high-tech space is how do I help and/or convince the customer to move to my shiny new tech?
Clearly, lowering the barriers to customer adoption is a huge help. No one wants to take on actually refactoring a large legacy application, then debugging it, unless they’re forced to by some external requirement. It is usually much easier to let sleeping dogs lie and start building a new application with a new architecture.
But, having said that, if there are APIs inside the application where the server of that API can be ported to a CS device, that will dramatically reduce the refactoring required.
Haystack needles to big hammers
It’s important to remember that the benefits of CS can be much more than latency. In one product, the bandwidth between the internal processors and the flash itself is an order of magnitude higher than the interface to the host CPU. Searching for needles in haystacks is best done by finessing interesting algorithms, but sometimes you just want a big hammer.
The HPC industry knows how to distribute computation across multiple ‘computers’, so a single compute node with several CS devices can be seen as a cluster-in-a-chassis, to positive effect.
Choose your poison
But, the challenges with cluster-in-a-chassis solutions are two-fold.
First, if that one chassis goes down, your entire cluster has gone down. That is an example of a ‘coordinated failure mode’ that essentially means you need two such chassis’ at each edge site if you care about availability.
With that, you’ll either have only 50% utilisation on each chassis or suffer a 50% drop in total performance when one chassis goes down.
So you need to choose your poison, basically.
A more distributed architecture can sustain the loss of any one node and the performance hit on the cluster as a whole after such a failure will be 1/N rather than N/2.
The second issue is the risk of overprovisioning at each edge site. IoT devices tend not to produce vast quantities of data, they’re much more likely to drip out very small packets of data on a continuous basis. One CS device may have more than enough power for all of the IoT devices that will be at or connected to, that site.
The CS device can then be seen as a single-board computer rather than a storage device. That’s good, but a CS device (in general) hasn’t been built to be standalone, it needs an Ethernet port rather than an NVMe port, it needs a H/A strategy, etc, etc.