kalafoto - Fotolia
NVMe (non-volatile memory express) is a new flash storage protocol that eliminates a lot of the overhead of legacy protocols such as SAS and SATA. The result is much lower latency and greater throughput for solid-state disks and storage-class memory.
Native NVMe is connected via the PCIe bus and in the enterprise it is being adopted as the standard for local persistent media. Suppliers are also starting to develop products that allow NVMe to operate over a network, where it is known as NVMe-over-Fabrics or NVMf.
To date, we have seen NVMf implemented on Fibre Channel (FC-NVMe), Ethernet (using RoCE) and InfiniBand. FC-NVMe operates over existing Fibre Channel equipment, albeit at the latest hardware revisions and with new firmware and drivers.
Ethernet requires specific hardware adaptors that support RDMA-over-Ethernet, although NVMe-over-TCP is emerging as a practical way to use NVMe over standard Ethernet-based networks and NICs.
We have looked at the big five storage array makers’ efforts in NVMe, but it is a fertile ground for storage startups, too.
As with any storage technology, the ability to create something that goes faster than existing systems is highly prized. Today’s modern applications in the area of finance and data analytics, to name but two, require scalable low-latency storage.
Startup solutions that can give the customer a business advantage will always be appealing. This aligns with the day-to-day need to reduce costs and improve the performance of existing applications.
Of course, to adopt any new technology requires overcoming some of the technical challenges already being experienced in the market.
It would be relatively simple to drop NVMe into existing products and we have already seen suppliers do that.
Meanwhile, NVMe using SCM (storage-class memory) such as Intel Optane, provides the ability to byte-address storage rather than access it in blocks.
What kind of challenges exist with current storage architectures? The most apparent is the need to channel I/O through shared controllers.
The controller provides the ability to map logical storage assignment (a LUN or file system, for example) onto physical media. It holds the metadata that implements data services such as deduplication, compression and snapshots.
But controllers also add overhead and act as a bottleneck, channelling I/O through a double-ended funnel.
Read more on NVMe flash storage
- NVMe flash offers blistering performance gains, but so far the big five storage array makers have tended to opt for gradual implementations rather than radical new architectures.
- NVMe can unleash flash by doing away with the built-for-disk SCSI protocol. But so far there is no consensus between suppliers about how to build products around NVMe.
Another challenge is to address many NVMe drives in parallel. Legacy storage architectures generally use SAS as a back-end storage protocol. While this offers great scalability – many drives can be connected together in a single system – the protocol still runs a single queue per drive, and that limits overall system performance.
But, in this scenario, it is not practical to connect many NVMe drives in a single server either. NVMe drives all sit on the PCIe bus, so there is a much lower limit in terms of the number of drives that can be connected to a single server. To counter that limitation, NVMe SSDs are starting to see multi-terabyte capacities per drive, which will help.
So how have architectures developed to make full use of NVMe?
There are a range of techniques in use. We will look at a few of them and how suppliers have translated these ideas into products.
NVMe option 1: Eliminate the controller
If the controller represents a bottleneck, then the answer for some has been to remove it from the equation. When there is no funnel to constrict I/O, servers can access much more of the I/O available from each drive by reading and writing to the drive directly.
In these products, the host application server talks directly to the media and bypasses the need for a central set of controllers. Of course, there is a trade-off here because the controller does add value in shared storage.
In the case of E8, metadata is stored on a separate server available to all of the hosts, and a small amount of computing resource is required on each host to manage metadata locally. In the future, this could be replaced by using SmartNICs that offer embedded processing and can offload metadata management to the RDMA NIC.
Apeiron’s ADS1000 storage appliance allows host servers that use custom NICs to talk directly to drives in the 2U chassis. As a result, data services need to be implemented at the host layer, but that results in an overhead only in the range of 2μs to 3μs.
NVMe option 2: Scale out
Vexata implements an architecture that scales front-end connectivity and back-end storage independently. Front-end I/O controllers (IOCs) connect to back-end enterprise storage modules (ESMs) using an ethernet midplane. Metadata that describes the use of storage media is retained on the ESMs and in DRAM on the IOCs. Processing capacity can be increased by adding more ESMs or IOCs as required.
Pavilion Data has designed hardware that resembles a network switch architecture. A single 4U chassis can accommodate between two and 20 controllers and up to 72 NVMe SSDs. Each controller provides four 100GbE host connections. Performance can be scaled at the front end by adding more controllers and at the back end by adding more drives. Metadata is managed on two redundant management cards.
NVME option 3: Software-defined
All the startup solutions presented so far are based on new hardware designs, but another route being taken by startups such as Excelero and WekaIO is to eliminate the storage hardware and go completely software-defined.
Of course, there has to be some storage hardware somewhere, but the benefit of the software-defined architecture is that solutions can be implemented either as a hyper-converged infrastructure (HCI), dedicated storage or even in public cloud.
Excelero has created a storage solution called NVMesh that implements NVMe-over-Fabrics through a proprietary protocol called RDDA.
Where RDMA connects multiple servers together and provides direct memory access, RDDA takes that step further and makes NVMe storage devices accessible across the network. This is achieved without the processor of the target server and so delivers a highly scalable solution that can be deployed in multiple configurations, including HCI.
WekaIO Matrix is a scale-out file system that is deployed across multiple servers in a cluster. Matrix is capable of scaling to thousands of nodes and supporting billions of files. The Matrix architecture allows direct communication with NVMe media and NICs, bypassing much of the Linux I/O stack.
Applications see what looks like a local file system, although data is distributed and protected across many nodes using a proprietary erasure-coding scheme called DDP (distributed data protection).
Matrix can also be run in public cloud (in Amazon Web Services today) on virtual instances that support local NVMe or SSD storage.
NVMe future developments
The systems we have discussed here use Ethernet/RDMA, InfiniBand or Fibre Channel to network storage and application servers together.
NVMe/TCP is emerging as the next evolution of NVMe-over-Fabrics and should provide the ability to implement high-speed storage with more commodity hardware.
As a result, we may see more startups in this space, as the technology required to build solutions becomes more mainstream.