New tools are available to assist in the diagnosis of complex network-based applications running over the LAN. These tools are much easier to use than old-style sniffers and guesswork, and they're much easier to explain to applications developers. Read this technical tip to learn about them and to decrease the time that your network staff has to invest in application diagnosis.
The network is the new backplane of a multi-server system; a single transaction often requires the close cooperation of multiple servers, databases and appliances connected to one another by network paths. Each of these devices is frequently updated with new software, and the network itself is always changing, with new paths, new switching devices and new configurations. It's therefore almost impossible to reproduce a modern multi-server, network-based production system in a test lab. Unfortunately, it's usually the subtleties of the production environment that cause problems in production applications.
Systems managers must therefore expect that programmers will appear in their operations centre. Those programmers want to trace an individual transaction through a maze of equipment and network paths, but often the tools available in the production environment provide only summary performance metrics and, possibly, some cryptic protocol traces from a few LAN segments. They must either guess at what is happening to individual transactions by looking at the summary data or spend hours trying to find and match up pieces of the transaction flow captured by protocol tracing utilities -- an often fruitless exercise, and one that requires protocol knowledge that programmers often don't have.
Dye tracing, which can follow an individual transaction through multiple servers, gives programmers a familiar diagnostic environment. When comprehensive dye-tracing facilities have been installed, programmers can watch each program call, its timings, and its parameters, as if the entire process were contained within a single server. They can watch as a synthetic (test) transaction arrives at a server, progresses through the applications software and the database calls, and then generates a response; they can watch individual customer transactions that are having problems to see where inside the applications and the servers the difficulty or bottleneck is occurring.
Dye tracers cost money; they require software to be inserted into production servers, which adds complexity. There is also a small performance penalty. For those reasons, some production organisations resist their use. But think of the time and effort saved when there's a problem! Programmers intuitively understand a dye tracer; it's similar to the tools they use in development. They can find a problem quickly, with much less involvement of the network operations staff and without the need to obtain protocol traces quickly during a crisis -- which may itself cause problems. The development group may even be willing to share the cost of the dye-tracing system, and they may want to use it during development to tune their applications -- which will also make them familiar with it when a crisis occurs.
A dye-tracing system works by inserting a software shim between the application programs and the underlying operating system. In some dye tracers, that shim watches all of the program calls, records their response time, and may also copy some of the call parameters, such as SQL queries. To trace a process from one server to another, some dye-tracing shims can insert a tracking number into inter-processor messages. The shims in the different processors then use that tracking number and remove it from the messages before they're seen by the applications program. The entire dye-tracing system is completely transparent to the application.
Dye tracers impose a slight load on the server systems, usually a very few percent. To decrease the load further, the dye-tracing system can normally be used on only a few servers in a load-distributed environment, for only a restricted percentage of transactions.
About the author: Eric Siegel is a senior analyst at the Burton Group. He is a nationally known authority on Web performance measurement and optimisation. He has 32 years of experience in design and evaluation of large computer networks and is the author of major portions of Burton Group's original Reference Architecture. At Burton, Eric specialises in Web and network performance optimisation, SLAs, network measurement and management, and QoS.