Troubleshooting WAN performance issues

How can you figure out why your WAN is not working? Read on to learn some useful approaches!

Isolating the root cause of the performance issue is critical, and having the right tools or managed services in place is imperative. Those tools can help determine whether the root cause of the problem lies in the application; the carrier network infrastructure; on a switch, router, firewall, or other network device; or whether plain old human error is to blame.

One way organizations are lightening the load of managing and troubleshooting their networks is through partnerships with third-party Managed Service Providers (MSPs). MSPs include carriers, major outsourcers, value-added resellers, systems integrators and, in some cases, even vendors. In the past year, there has been tremendous growth in the number of companies using MSPs. In 2005, 27% of organizations said they were using MSPs to help them manage their branch offices. By late 2006, that figure had increased to 46%.

In some cases, organizations completely outsource the management of branch offices; in other cases, they partner with the MSP. For example, they may have the MSP handle all implementations and Level 1 support issues, and they address training and Level 2 and Level 3 support issues internally.

Still, many companies have the internal expertise to manage and troubleshoot their network using only internal resources. Companies buy and implement numerous management and monitoring tools for different applications, devices, and WAN infrastructure. The key, though, is to settle on a single Manager of Managers (MoM) to help combine and leverage the plethora of tools in any network.

The actual amount companies spend on management tools varies widely. Just looking at MoMs, companies spend an average of $US1,581 per employee to acquire them and another $US2,240 per employee to maintain them, according to the Nemertes Service Delivery and Management benchmark.

Nemertes has documented that having one MoM in place drastically reduces the average mean time to repair a problem, when compared with multiple MoMs or none at all. It costs an average of $US64.90 in staff time to repair a problem when a single MoM is used, compared with $US274.25 with no MoM and $US523.08 with multiple MoMs.

These tools, along with system, application, and device-specific products and carrier network-management portals, help network managers isolate and resolve problems. That makes it easier to determine whether a problem stems from the application, the carrier network, a router, or other areas.

When trying to isolate a trouble ticket, these are some must-have items on the checklist:

  • Is the server that is hosting the application healthy and performing well?
  • Is the problem isolated to a user or group at a particular location, or is it affecting all users there? If an affected user sits at an unaffected user's desktop, does the problem follow him or her? (Very rare!)
  • Are users in multiple locations -- or just one location -- affected?
  • Are other applications affected or just one?
  • If there is a local service to test (file sharing in a workgroup, etc.), is that fast or slow?
  • Does monitoring software show any WAN or LAN usage spikes that correspond to trouble times?
  • And always, the meta question: Has anything changed recently in the infrastructure between app and user, including user PC (was it re-imaged)?
  • Is there a NIC malfunction on a PC in question?
  • Is spyware/malware choking a machine's performance? This tends to hit one machine but not others, so it can help isolate the cause of a performance issue.
  • Is malware/zombie machine flooding a local network segment, WAN link, or Internet connection? A well-configured network will contain the damage as much as possible.
  • Is there a malfunction in edge switch or intermediate distribution switch? This can cause problems with a group, floor, or building.
  • Is the bandwidth shaper misconfigured, or is it putting too low a priority on critical traffic?
  • Is there a legitimate WAN problem? Is there congestion from legitimate use of an application? Is it time (or past time) to upgrade circuits?
  • Are security measures placing traffic in quarantine when they should not?
  • Is someone sending unusually large chunks of data when you don't have QoS set properly to deal with it?
  • Do you need bandwidth-optimization to address latency of real-time traffic?

About the author: Robin Gareiss is Executive Vice President and Senior Founding Partner for Nemertes Research, where she oversees research projects and direction, conducts strategic seminars, develops cost models, and advises leading enterprises, vendors, and carriers. For the past 17 years, Robin Gareiss has worked closely with hundreds of senior IT executives, analyzing their use of technology and capturing best practices. Robin is a widely recognized expert in voice over IP, convergence, collaboration, carrier services, IP networking, and branch-office technologies.

Read more on WAN performance and optimisation