luchschen_shutter - Fotolia
How will NVMe flash affect storage area networking, in particular with the introduction of NVMe-over-fabrics?
The transport protocol deployed has been predominantly Fibre Channel, with iSCSI also evident for smaller, less performance-hungry use cases.
But now the performance limits of long-standing storage protocols such as Serial-attached SCSI (SAS) and Serial ATA (SATA) are being challenged by NVMe.
So, with NVMe being carried over Fibre Channel, Ethernet, Infiniband and others in NVMe over Fabrics, how has the market responded so far?
Storage protocol primer
Fibre Channel (FC) is a transport layer for the SCSI storage protocol, and FC devices – host bus adaptors (HBAs) and switches – provide the mechanism to physically connect hosts to shared storage. The underlying SCSI storage protocol is one we’ve used for nearly 40 years.
The iSCSI protocol is a little more obvious. In this instance, the transport layer for SCSI is TCP/IP over Ethernet.
FC and iSCSI have been great at delivering low latency for shared storage in a world built on disk-based and hybrid arrays. But, as we move to solid-state media (flash and storage-class memory) SCSI has become relatively inefficient in terms of introducing latency.
So, the industry developed NVMe as a direct access protocol to connect fast storage via the PCI Express bus of the server, with much less overhead than used to be seen with the SCSI-based SAS and SATA.
NVMe will be the future protocol of choice for solid-state media in the server.
NVMe over Fabrics
The next logical step for NVMe is to emulate what was done for SCSI with an implementation that can be used over a fabric, such as Fibre Channel, Ethernet or InfiniBand. This is exactly what NVMe over Fabrics delivers, with high-speed, low-latency NVMe over a network.
NVMe over Fabrics describes solutions that connect hosts to storage across a network fabric using the NVMe protocol. There are currently two main implementations.
The RDMA implementation can use a range of transports that include InfiniBand, RDMA over Converged Ethernet (RoCE) and Internet Wide Area RDMA Protocol (iWARP).
NVMe over Fabrics implementations
Having the ability to use multiple physical transports provides for a wide range of implementation scenarios.
IT environments that already implement Gen5 (16Gbps) or Gen6 (32Gbps) Fibre Channel will be able to take advantage of FC-NVMe without hardware replacement.
SCSI-based Fibre Channel and NVMe-based Fibre Channel can exist on the same fabric at the same time.
This means IT organisations that replace existing hardware with solutions that support FC-NVMe at the front end of an array won’t need to rip and replace to take advantage of the technology and should see an instant improvement in performance.
But, other implementations of NVMe over Fabrics, including those that work over Ethernet and InfiniBand, will require new hardware.
This means deploying dedicated switch equipment such as 40GbE and 100GbE Ethernet. At the host, dedicated RDMA network interface controllers (RNICs) are required that support RDMA-over-Converged Ethernet (RoCE).
RoCE v1 is an Ethernet layer protocol, which limits the connection of hosts to a single Ethernet broadcast domain. Meanwhile, RoCEv2 is implemented at the internet layer with User Datagram Protocol (UDP) and so is routable. iWARP is also routable – it is implemented at the TCP layer – although to date this doesn’t appear to be widely implemented by suppliers.
In practice, unless ultra-low-latency performance is required across all applications in the datacentre, it’s likely we will see NVMe over Fabrics Ethernet/InfiniBand solutions deployed tactically for only those applications that will really benefit.
These will be, for example, financial trading applications, real-time analytics (fraud detection and security) and machine learning/AI applications where ultra-low latency reduces training times.
NVMe over Fabrics in products
There is a range of implementations from suppliers that include support for NVMe over Fabrics on existing products and solutions that implement disaggregated, hyper-converged or rack-scale storage.
NetApp has two solutions that use NVMe over Fabrics, based on its E-series and Ontap respectively. The EF570 series (all-flash) and EF5700 (hybrid flash) support NVMe over Fabrics using 100Gbps InfiniBand, with sub-100µs latencies. The recently announced AFF A800 offers FC-NVMe support, with latencies of 200µs or better.
Kaminario has a scale-out solution called K2.N that separates control and data into separate nodes. Front-end connectivity in the Kaminario product supports NVMe over Fabrics, Fibre Channel and iSCSI with a claimed performance level of around 100µs.
E8 Storage implements a slightly different design that disaggregates a standard shared appliance architecture and devolves some functions of shared storage to hosts that connect to a shared storage appliance. The aim is to alleviate bottlenecks typically experienced with dual-controller architectures. The platform supports RDMA-enabled adaptors with 10Gbps Ethernet and InfiniBand and provides 100GbE or 100Gbps InfiniBand connectivity. Latencies are claimed to be as low as 100µs (read) and 40µs (write).
Apeiron Data also offers a disaggregated solution. The ADS1000 shared storage appliance connects to hosts using a custom HBA that supports 40Gbps Ethernet and implements a custom protocol called NVMe over Ethernet. Apeiron claims its solution introduces around only 2.7µs of overhead into host I/O, with Nand-based storage delivering 100µs latency and Optane-based solutions around 12µs read and write performance.
Pavilion Data Systems has developed a rack-scale NVMe over Fabrics solution that deploys as a 4U chassis with up to 72 NVMe SSDs. Front-end connectivity is established using between two and 10 I/O line cards, each of which supports two storage controllers and four 100GbE Ethernet network ports. Each host requires RNICs that support standard RDMA. The architecture differs from other solutions in that each NVMe drive is dedicated to a single host. Pavilion Data claims 120Gbps of throughput with 100µs of latency.
Excelero has a software-defined scale-out architecture that can be used to build a dedicated storage appliance or be implemented as hyper-converged infrastructure (HCI). NVMesh connects multiple servers together, each of which can be consumers (clients) and providers (targets) of storage. Each node in the architecture uses RNICs and NVMe drives to virtually eliminate any CPU overhead on targets. Communication is through a custom protocol called Remote Direct Drive Access (RDDA).
NVMe over Fabrics’ future promise
Suppliers have started to offer NVMe over Fabrics-enabled products, with many more saying they will bring NVMe over Fabrics to the market over coming weeks and months. All these solutions use NVMe as a back-end protocol with either Nand or Optane media, without which the front-end performance could not be delivered.
We can see the emergence of a number of different designs. There are “retrofitted” traditional arrays, new architectures that disaggregate or otherwise look to remove the bottleneck of the controller, and rack-scale solutions that consolidate connectivity into a single chassis.
Another point of interest is to examine the input/output (I/O) overhead generated by the platform and media.
Where the I/O path has been heavily optimised the performance of solutions with Nand flash or Intel Optane is not much greater than the overhead of the media itself.
As a result, we are likely to see products that use storage-class memory such as Optane where I/O performance justifies the cost.
Not all suppliers will be able to do this, however, as the benefit will not always be realised, in which case hybrid solid-state solutions will be the way forward.
Read more about NVMe flash
- New media provide a range of options to speed workloads, from “old-fashioned” flash to storage class- and persistent memory. We help you exploit the storage performance hierarchy.
- NVMe could boost flash storage performance, but controller-based storage architectures are a bottleneck. Does hyper-converged infrastructure give a clue to the solution?