For more than 30 years the storage industry has relied on the small computer system interface (SCSI) protocol to communicate between servers and storage and internally to the storage array. And although the physical connection has evolved, the protocol has remained relatively constant for many years.
But, with the advent of Nand flash storage, issues have started to arise. Flash is orders of magnitude faster than spinning disk and can handle many requests in parallel. As suppliers push the limits of scalability to drives with tens of terabytes of flash, increasingly SCSI has become the bottleneck to making full use of flash storage.
NVMe and the potential for flash
NVMe is a protocol, not a form factor or type of media. Physical NVMe-enabled devices come in a range of form factors, such as AIC (Add-in Card, or what is traditionally known as a PCIe card), U.2 (similar to a traditional hard disk drive) and M.2 (a memory stick). All use PCIe as the interface bus.
NVMe reduces the software latency involved in talking to flash, improves hardware interrupt times (processor to device performance) and increases the parallel processing of requests with more and deeper input/output (I/O) queues than SCSI (65,535 queues, and a queue depth of 65,535). The result is much higher throughput (IOPS and data volumes) and lower I/O latency.
Suppliers have started to adopt NVMe, and have used it to build out new and faster storage arrays with NVMe internally deployed. There are also newer architectures, with NVMe over fabrics (NVMf) that can take advantage of Fibre Channel and Ethernet networks.
Here, we look at supplier solutions and how they are using NVMe within existing and future products.
NVMe in storage products
X-IO is well known for its ISE range of sealed disk units. It recently returned from financial issues with the new Axellio platform, a 2U server-based dual controller architecture with up to four Intel Xeon E5-2699v4 processors (88 cores total), up to 2TB of DRAM and from one to six FlashPacs, each of which can hold up to 12 dual-ported NVMe solid-state drives (SSDs). Total capacity with 6.4TB drives is currently 460TB per system.
The core of the system is a PCIe fabric, called FabricXpress, that connects both controllers to each dual-ported drive. This allows X-IO to claim up to 12 million IOPS (4KB) and 60Gbps sustained throughput at 35µs latency. The design of Axellio is, at its core, a dual-controller architecture, but with 88 cores available, the platform can be used as the basis of a traditional storage device or scale-out hyper-converged platform. The ability to support additional plugin modules offers the capability to run analytics or other process-intensive workloads, and this is where the NVMe architecture offers real value, putting compute as close as possible to storage.
Pure Storage upgraded its existing FlashArray architecture with the release of FlashArray//X in April 2017. In fact, the FlashArray product was already capable of supporting NVMe, with NVMe capability introduced through a new //X70 controller and DirectFlash NVMe drive modules. The result is a halving in latency compared to FlashArray//M, with twice the performance throughput and a four times increase in what Pure calls performance density. The performance improvements from FlashArray//X may not seem that great – remember customers still have to use standard Fibre Channel and iSCSI protocols at the front end – but the performance density improvements allow for a much more compact footprint. Pure achieves almost a petabyte of capacity per 3U of rack space in a single chassis (assuming 5:1 space reduction), with greater performance than the previous FlashArray//M.
The more interesting story for Pure is the future development of NVMe-over-Fabrics, which should increase front-end performance and allow a single controller head to address more flash capacity. DirectFlash shelves aren’t yet available, but Pure promises to extend systems capacity with up to 512TB of additional flash using 50Gbps Ethernet and Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE).
Staying with the NVMe over fabrics theme, Excelero is a startup looking to use NVMf to develop a scale-out node-based architecture called NVMesh. An NVMesh system has multiple controllers connected through Converged Ethernet and RoCE using a proprietary technology called Remote Direct Drive Access (RDDA). This allows any node to access any drive with little or no processor overhead in the system where the drive resides. Like Axellio, NVMesh can be deployed in a hyper-converged arrangement where each node provides compute and storage, or as a dedicated storage platform, with compute nodes running a client block driver.
However, NVMesh is sold as a software solution that allows customers to use their own hardware or purchase hardware from partners such as Micron, which incorporates NVMesh into its SolidScale product. Excelero claims NVMesh can exploit near 100% of the NVMe capacity for host I/O and that certainly makes sense with this kind of disaggregated architecture. But there are compromises. Currently, data protection is limited to RAID-0, RAID-1 and RAID-10, with no support for space reduction technology – unless implemented by the client. This is, however, promised on the roadmap.
E8 Storage is another startup that uses a fabric-based approach to connect clients and storage. The E8-S24 and E8-D24 series appliances divide I/O path and control planes into separate hardware. E8 disk shelves house 24 NVMe SSD drives and offer four times 100GbE or eight times 100GbE networking respectively. A single shelf can connect to up to 96 client servers using RDMA NICs. Data services (availability and management) are handled through a pair of E8 controllers that don’t sit in the data path.
Like Excelero, E8 Storage disaggregates NVMe capacity with system management and takes the controller out of the I/O path. This provides for much greater system scalability without the need to deploy large numbers of Xeon processors in each controller, but does introduce client-side complexity through the use of additional drivers. However, E8 claims to be able to achieve latencies of 100µs (read), 40µs (write) with 10 million read IOPS and one million write IOPS and 40Gbps and 20Gbps throughput respectively.
Apeiron Data Systems is another startup that uses NVMe over networks and takes the controller out of the data path. In this case, Apeiron’s ADS1000 platform uses a protocol called NVMe-over-Ethernet, which requires custom host bus adapters (HBAs) in each client. The HBAs use Intel Altera FPGAs to package up NVMe requests and send them over layer 2 Ethernet to enable a scale-out architecture that promises latency as low as 100µs, with each enclosure capable of providing a maximum of 384TB of capacity (24 16TB drives). It’s interesting that Apeiron has also quoted performance figures with Intel Optane that claim 12µs read/write latency. This points to where architectures that remove the controller bottleneck are headed in terms of performance.
Finally, we should also mention Kaminario, which recently announced its K2.N platform. This is a composable storage infrastructure that allows independent scaling of storage shelves and controllers. At the back end, Kaminario controllers (c.nodes) are connected to storage capacity (m.nodes) using NVMe over Fabrics. At the front end, host support is extended to support NVMf in addition to Fibre Channel and iSCSI.
Where to for NVMe flash architectures?
We can see three distinct patterns emerging
- The use of NVMe-over-fabrics for host connectivity in place of Fibre Channel or iSCSI. This requires no new hardware because existing HBAs can support NVMf.
- Disaggregated architectures that take the controller out of the data path. Most of these need custom or (likely) more expensive HBAs and host drivers.
- The trend to using NVMe-over-fabrics at the back end of the system.
I think in a short time we can expect NVMe to replace SAS and SATA in all performance solutions. The challenge for customers could be how to implement shared storage that doesn’t use the traditional shared storage array. This means thinking in a more integrated fashion with the whole stack – perhaps along the lines of how hyper-converged infrastructure has developed.
Read more about NVMe flash storage
- NVMe offers to unleash performance potential of flash storage that is held back by spinning disk-era SAS and SATA protocols. We run through the key NVMe deployment options.
- NVMe offers the potential to really unleash flash storage but adding controller functionality is a conundrum. Distributing CPU work seems to be the key.