Enterprise NAS performance challenges and how to resolve them

NAS storage can be used for mission critical workloads but NFS-based storage comes with performance challenges. Jim Bahn of Virtual Instruments outlines the main ones

Jim Bahn, Virtual Instruments

Published: 06 Oct 2016

Business critical and network-attached storage (NAS) are terms rarely seen together.

Storage area network (SAN)-based storage built on Fibre Channel infrastructure has dominated enterprise datacentres for nearly two decades. This is chiefly due to its performance and reliability for the most mission-critical workloads.

Meanwhile, VMware, NetApp, Dell EMC, Microsoft and others are all proponents of using network file system (NFS)-based storage systems to support mission-critical workloads such as databases, virtual desktop infrastructure (VDI) and other performance-sensitive applications.

NAS users, however, face issues specific to file access storage via NFS that hinder its take-up in some critical use cases. These include:

Metadata bottlenecks

Unique to NFS are concerns around metadata and, in a scale-out NAS environment, communication bottlenecks between nodes in the cluster in which access to any file has to be handled via metadata commands. NFS, like any file system, uses metadata to track information about each file it stores. This metadata information can be as basic as creation and modification dates, or as sophisticated as the serial number of the device creating data and access permissions. In a production NFS environment, metadata can be as much as 90% of the input/output (I/O) command stream. Detecting metadata issues early is critical to scaling NAS to its specific capabilities.

Rogue clients and noisy neighbour issues

A common performance complaint involves hanging file locks by network lock managers, which can cause applications to slow down or stop. It’s vital to watch for evidence of rogue clients, which can hold a critical file lock, and block other application processes from executing. Another area of contention is the noisy neighbour problem. The term “noisy neighbour” in the storage context refers to a rogue virtual machine (VM) that periodically monopolises storage I/O resources to the performance detriment of other VMs in the environment. This phenomenon can become more pervasive as VM density per host increases. Unless it’s properly monitored, predictable performance is impossible.

Server/VM latency issues

Poor storage performance is generally the result of high I/O latency. It’s important to monitor latency from the VM to the file system for each host-to-file flow. Keep in mind that thanks to the protocol, applications are bursty, so users reporting on five-minute averages are almost always going to miss important events. Problems can often be caused by poor load balancing, although careful monitoring can easily prevent this issue.

Poor write performance

Writing large, sequential files over an NFS-mounted file system can cause a severe decrease in file transfer rate to the NFS server. The version three commit remote procedure call (RPC) does allow for reliable asynchronous writes, with the potential, though not common, cost of data loss. This means constant monitoring is critical. If the user monitors correctly and doesn’t see the expected write performance, application programmers can play with several settings, such as write datasync action, write datasync reply, write filesync action, write filesync reply, write transfer max, write transfer multiple, write transfer preferred, write unstable action and write unstable reply.

Cluster node bottlenecks

As individual NFS storage systems became increasingly used for important workloads, problems appeared with the traditional scale-up NAS approach, and this led organisations to deploy dozens of NAS systems. To overcome these problems, suppliers created scale-out NAS systems, and while they solved the capacity scaling problem, they created an even greater metadata performance problem. They also added the new problem of cluster node bottlenecks. As a scale-out NAS system adds more nodes, inter-node cluster communication increases at an exponential rate. The greater number of nodes bring more communication. Any issue of inter-networking between these nodes can easily lead to poor system response time.

Enterprise NAS performance challenges and how to resolve them

NAS storage can be used for mission critical workloads but NFS-based storage comes with performance challenges. Jim Bahn of Virtual Instruments outlines the main ones

Read more about storage performance

Read more on Managing IT and business issues

Hammerspace adds S3 interface to Global Data Platform

Storage technology explained: File, block and object storage

Is hyperscale NAS the future of storage in the AI era?

New Hammerspace capability sets up enterprise AI