SAN troubleshooting: What are the key steps and best practices when troubleshooting a SAN?
When SAN troubleshooting for performance issues, the best way to find problems is to take a methodical end-to-end approach.
SAN performance issues can be the result of many different causes, so collecting and analysing data from all components -- the host(s), SAN switches and storage array(s) -- is a good base to work from.
Monitoring tools should be used as a matter of course as well as during actual SAN troubleshooting, with sample counters providing as much granularity in collection as is feasible. Monitoring tools include host-based (e.g., Perfmon for Windows-based systems), array-based tools and SAN switch tools or scripts.
It’s a great help to have pre-problem baseline statistics for comparison, and establishing a process for collecting the relevant metrics as part of your day-to-day operations is well worth thinking about.
Gathering data over an extended period of time allows for an understanding of workload trends and can give an indication of event-influenced problems (e.g., backup jobs kicking off, batch scripts executing, manually triggered data copying, etc.). The greater the granularity of performance data captured, the more accurate a picture of the performance peaks you’ll get. Looking at this from another view point, the longer the interval, the more “peak” data is averaged out and the true picture of what is going on is diluted.
As to what metrics to collect, obviously the more the better, but the absolute essential data points to collect for SAN troubleshooting are:
Host and array
- Response times, in milliseconds: Anything above 15 ms should be considered for further investigation.
- Average queue length: This should be less than the number of spindles that make up the volume: for example, RAID 1 = 2 disks; RAID 5 4+1 = 5. Anything higher should be investigated.
- Utilisation percentage of LUNs: This will indicate the hardest-working spindles and help with locating problems.
- IOPS (read; write; read/write). This will indicate the I/Os per second being serviced by the storage array.
- CRC (cyclic redundancy checking) errors: A high number of CRC errors can indicate a problem with GBIC/SFP connectors or problems with the physical cabling for a given port.
- Port utilisation (MBps): This will indicate the workload of a given port. Examining port utilisation can help understand throughput and help identify whether there is a bottleneck here.
By examining these metrics and, if possible, comparing them with baseline statistics, you can locate most SAN performance problems to a particular SAN component and take steps to rectify the issue.
This was first published in March 2011