Maxim_Kazmin - Fotolia

Enterprise NAS performance challenges and how to resolve them

NAS storage can be used for mission critical workloads but NFS-based storage comes with performance challenges. Jim Bahn of Virtual Instruments outlines the main ones

Business critical and network-attached storage (NAS) are terms rarely seen together. 

Storage area network (SAN)-based storage built on Fibre Channel infrastructure has dominated enterprise datacentres for nearly two decades. This is chiefly due to its performance and reliability for the most mission-critical workloads.

Meanwhile, VMware, NetApp, Dell EMC, Microsoft and others are all proponents of using network file system (NFS)-based storage systems to support mission-critical workloads such as databases, virtual desktop infrastructure (VDI) and other performance-sensitive applications.

NAS users, however, face issues specific to file access storage via NFS that hinder its take-up in some critical use cases. These include:

Metadata bottlenecks

Unique to NFS are concerns around metadata and, in a scale-out NAS environment, communication bottlenecks between nodes in the cluster in which access to any file has to be handled via metadata commands. NFS, like any file system, uses metadata to track information about each file it stores. This metadata information can be as basic as creation and modification dates, or as sophisticated as the serial number of the device creating data and access permissions. In a production NFS environment, metadata can be as much as 90% of the input/output (I/O) command stream. Detecting metadata issues early is critical to scaling NAS to its specific capabilities.

Rogue clients and noisy neighbour issues

A common performance complaint involves hanging file locks by network lock managers, which can cause applications to slow down or stop. It’s vital to watch for evidence of rogue clients, which can hold a critical file lock, and block other application processes from executing. Another area of contention is the noisy neighbour problem. The term “noisy neighbour” in the storage context refers to a rogue virtual machine (VM) that periodically monopolises storage I/O resources to the performance detriment of other VMs in the environment. This phenomenon can become more pervasive as VM density per host increases. Unless it’s properly monitored, predictable performance is impossible.

Server/VM latency issues

Poor storage performance is generally the result of high I/O latency. It’s important to monitor latency from the VM to the file system for each host-to-file flow. Keep in mind that thanks to the protocol, applications are bursty, so users reporting on five-minute averages are almost always going to miss important events. Problems can often be caused by poor load balancing, although careful monitoring can easily prevent this issue.

Poor write performance

Writing large, sequential files over an NFS-mounted file system can cause a severe decrease in file transfer rate to the NFS server. The version three commit remote procedure call (RPC) does allow for reliable asynchronous writes, with the potential, though not common, cost of data loss. This means constant monitoring is critical. If the user monitors correctly and doesn’t see the expected write performance, application programmers can play with several settings, such as write datasync action, write datasync reply, write filesync action, write filesync reply, write transfer max, write transfer multiple, write transfer preferred, write unstable action and write unstable reply.

Cluster node bottlenecks

As individual NFS storage systems became increasingly used for important workloads, problems appeared with the traditional scale-up NAS approach, and this led organisations to deploy dozens of NAS systems. To overcome these problems, suppliers created scale-out NAS systems, and while they solved the capacity scaling problem, they created an even greater metadata performance problem. They also added the new problem of cluster node bottlenecks. As a scale-out NAS system adds more nodes, inter-node cluster communication increases at an exponential rate. The greater number of nodes bring more communication. Any issue of inter-networking between these nodes can easily lead to poor system response time.

Read more about storage performance

  • Array makers’ storage performance specs are not always what they seem. Storage analysts explain how suppliers spin spec sheet figures.
  • Storage array makers’ spec sheets can be difficult to translate and sometimes prove misleading. The trick is to dig out the devil in the detail.

So, how do customers address these issues? Currently, users tend to start out with NAS supplier tools, but they frequently find they require knowledge of more than just the NAS device to solve the problem. That’s because virtualisation, server, networking and physical layer issues can all contribute to performance latency.

Users also often resign themselves to using complicated protocol analysers, such as Wireshark, and sending packet captures (PCAPs) to supplier support teams. This is resource wasteful and doesn’t give a known time to resolution.

One alternative is to just buy more NAS hardware in an attempt to compensate for the problem. Buying more capacity is certainly popular with suppliers, but it’s expensive and does not always address the root cause of the problem.

A better approach is to monitor and analyse real-time I/O data from the VM on the server, in transit via the network, and within the file system and volume of the NAS device. This approach provides complete visibility, and the user has the chance to see performance issues first hand.

The visibility required to develop and deliver performance-based SLAs cannot be provided by individual device-specific data or polling-based averages. It must be based on a comprehensive view of the end-to-end system and granular measurement of actual infrastructure response times, measuring every transaction or request in real time and on the wire. This enables the user to establish accurate baselines and measure changes over time, based on the actual workload and associated response times.

Resolution of business-critical NAS issues is all about real time, line speed monitoring and analysis. If users can see all the elements in the scaled-out system, and see their key performance and utilisation metrics, it’s easier to become proactive in managing the resources. The additional benefit is to avoid over-provisioning capacity and being able to scale or tier storage depending on workload performance requirements.

Jim Bahn is senior director of product marketing at Virtual Instruments.

Read more on Managing IT and business issues

Data Center
Data Management