When customers invest in storage technologies such as all-flash arrays, from the likes of Violin Memory or Pure Storage, they expect to see an immediate improvement in operational speed to justify their outlay.
Most new storage technologies are designed to deliver an immediate improvement in performance, for example by means of faster drives or a bigger cache – but this does not always happen.
When speaking to customers or SAN troubleshooting, we find most unexpected slowdowns occur after new technologies are deployed, and are often caused by misconfigurations at different points in the infrastructure, from virtual or physical host through to storage.
For this reason, the list of customers that have invested in faster storage only to find application performance hasn’t improved is very long.
When we look at what causes poor performance, there are several pain points that come up on a regular basis. These include:
Storage array configuration
Frequently, for example, a customer-facing database or other application turns out to be far more popular than expected. One customer recently expected 60,000 users on a new application and ended up with 3 million, which completely overwhelmed the storage, network and hosts.
It’s often the case that the initial design can be sufficient and architected as well as possible at the time, but once the application is under load the array itself isn’t always sufficient to handle it. Also, things change, and expecting a storage array to perform for up to three to five years against all workloads is fairly optimistic.
So, it becomes critical to measure the components that make up the I/O stack, and that needs to be at the right level of granularity. We often see measurements at much longer intervals than milliseconds, which is what you need to look at I/O performance.
A problem may not be highlighted if you’re not looking at every I/O and in real-time. It’s a common mistake to equate historical data from polling with real-time data. Also, most array suppliers only keep 24 hours of data, so it may not be possible to identify the problem and spot trends before issues arise.
Read more about storage performance
When we move to the second part of the stack – to Fibre Channel switches – there are often issues with switch performance that have little to do with the supplier. Brocade and Cisco make great SAN switches but, just like the array, they are only one device in the stack.
Some believe they can get all the performance information they need right out of a SAN switch. But, unfortunately, that’s not the case. If I can see how busy the freeway is (throughput) I still don’t necessarily see how long it’s going to take me to get home (latency). And what does my family care about? When I get home.
I would argue users running applications on storage infrastructure are looking to the same thing. Latency is what we need to know about. Throughput, while critical, is not so important. And it’s clear from the customer feedback we get, that measuring throughput at the switch level doesn’t give a good indication of what the I/O experience is like.
Physical layer issues
Bad connections can often result in re-issuing commands that lead to a flood of communications. That slows down databases and eradicates the benefits of flash storage. It doesn’t matter how much flash storage you buy, if your physical layer is not intact and healthy, you will not take full advantage of the investment made.
Another aspect that can cause real slowdowns is queue depth, and a key reason for this is that it is often set by the server team and not the storage team. Unfortunately, the larger your environment gets, the more difficult it is to manage this issue. One server manager may change queue depth to increase their own performance and this can impact other users (or servers) sharing that path.
The size of read/write functionality needs to match application database block size to storage. If not tuned for the right block size it can result in performance issues. This will be highly dependent on the type of application that is running. Clearly you can mitigate this issue through faster disk, but even so, there is a lot to be said about getting those settings right.
Even in a virtualised environment, physical servers still have finite CPU. Customers often ask a lot of their physical devices without adding CPU and memory. More often than not the VMware administrator will allocate too much CPU or not enough. This can impact applications no matter what flash drives are in place. For flash to make a difference, there also need to be enough physical servers and CPU.
Most performance problems are not down to storage
When applications are not operating to expectations, it can be any manner of misconfiguration at fault. We estimate through our experience of end-to-end monitoring that 75% to 85% of all issues are not the result of storage array problems but often of something else in the stack. And, the more layers in the stack, and the more densely you virtualise, the worse the issue gets.
The way to pinpoint these issues is through a real-time whole IT infrastructure monitoring and by proactive performance management, so that any mismatches can be identified before they become real glitches. This especially applies to the larger datacentres operating up to hundreds or even thousands of servers, where identifying a problem can be like trying to find a needle in a haystack.
Customers who rely on their IT infrastructure to support mission critical activities often find, to their cost, that before introducing new technologies, it’s wise to be in control and have a detailed view of the whole IT infrastructure and its I/O performance.
Before introducing flash, it’s important to audit the IT infrastructure from end to end – virtual machines, servers, storage fabric, storage arrays and LUNs – to pinpoint performance issues. Such an end-to-end view can allow the customer to invest wisely in the right place rather than throw money at the problem without a good result.
Alex D’Anna is solutions consultant at Virtual Instruments