Answer

SAN troubleshooting: What are the key steps and best practices when troubleshooting a SAN?

Learn which tools you need to perform SAN troubleshooting, which metrics point to a sign of trouble, and why granular and regular data collection is important.

Allaster Finke

Published: 07 Mar 2011

SAN troubleshooting: What are the key steps and best practices when troubleshooting a SAN?

When SAN troubleshooting for performance issues, the best way to find problems is to take a methodical end-to-end approach.

SAN performance issues can be the result of many different causes, so collecting and analysing data from all components -- the host(s), SAN switches and storage array(s) -- is a good base to work from.

Monitoring tools should be used as a matter of course as well as during actual SAN troubleshooting, with sample counters providing as much granularity in collection as is feasible. Monitoring tools include host-based (e.g., Perfmon for Windows-based systems), array-based tools and SAN switch tools or scripts.

It’s a great help to have pre-problem baseline statistics for comparison, and establishing a process for collecting the relevant metrics as part of your day-to-day operations is well worth thinking about.

Gathering data over an extended period of time allows for an understanding of workload trends and can give an indication of event-influenced problems (e.g., backup jobs kicking off, batch scripts executing, manually triggered data copying, etc.). The greater the granularity of performance data captured, the more accurate a picture of the performance peaks you’ll get. Looking at this from another view point, the longer the interval, the more “peak” data is averaged out and the true picture of what is going on is diluted.

As to what metrics to collect, obviously the more the better, but the absolute essential data points to collect for SAN troubleshooting are:

Host and array

Response times, in milliseconds: Anything above 15 ms should be considered for further investigation.
Average queue length: This should be less than the number of spindles that make up the volume: for example, RAID 1 = 2 disks; RAID 5 4+1 = 5. Anything higher should be investigated.
Utilisation percentage of LUNs: This will indicate the hardest-working spindles and help with locating problems.
IOPS (read; write; read/write). This will indicate the I/Os per second being serviced by the storage array.

SAN switch

CRC (cyclic redundancy checking) errors: A high number of CRC errors can indicate a problem with GBIC/SFP connectors or problems with the physical cabling for a given port.
Port utilisation (MBps): This will indicate the workload of a given port. Examining port utilisation can help understand throughput and help identify whether there is a bottleneck here.

By examining these metrics and, if possible, comparing them with baseline statistics, you can locate most SAN performance problems to a particular SAN component and take steps to rectify the issue.

SAN troubleshooting: What are the key steps and best practices when troubleshooting a SAN?

Learn which tools you need to perform SAN troubleshooting, which metrics point to a sign of trouble, and why granular and regular data collection is important.

Related Q&A from Allaster Finke

RAID 10 vs RAID 50: What is the best way to configure a storage array with 16 1 TB drives?

What is a LUN and why do we need storage LUNs?

RAID 5 configuration: How to calculate disk space for data storage and parity storage

Read more on SAN, NAS, solid state, RAID

Troubleshoot SAN issues to improve performance

Payday lender cuts troubleshooting by 30% with Virtual Instruments

Broadcom Brocade launches 2 new SAN management products

How do I troubleshoot SIP configurations to solve service issues?