IT failure: The highs and lows of high availability

The causes of IT outages are many and varied and, more importantly, depend on the viewpoint of the observer

IT failures are often thought to be either intrinsic to hardware or software or triggered by some natural disaster such as a flood. This is sometimes the case but the causes of outage are in fact many and varied and, more importantly, depend on the viewpoint of the observer.

The viewpoints of interest are the user of the IT service and the IT person's view of the system. A computer system can be working perfectly yet to the end user be perceived as not available because the service he uses is not available.

Outages – physical and logical

The simple reason is that there are several types of "outages" that can impact a system or a service. One is a failure of hardware or software so an application or service is unavailable. This is a physical outage. The other outage is a logical one where the physical system is working but the user cannot access the service properly.

There are several types of logical outage, not including a degraded service, which users will often class as unavailable. One type is where processing is severely hindered by a constraint on the system set by operations parameters, for example, something like "maxusers=100" where users over and above this are not served.

Top of the class

A second type can be found where processing is done on a class transaction basis as it was in the IBM Information Management Systems (IMS). In this system, transactions were assigned a class (A, B, C etc) and computer regions were assigned classes, which they were allowed to process.

Imagine a setup where C class transactions were in a majority and only one region was configured to process them. There would be a rapid build-up of a queue of class C transactions and eventually the response times would be so bad that the service on class C transactions would be deemed "down" by users.

Leap year bugs London Stock Exchange  

Another example I came across is the London Stock Exchange outage on 1 March 2000. The cause of this was that jobs had run in the wrong order so that Job 2 on system 2 ran before Job 1 on system 1, instead of the reverse.

The resulting databases were in a mess. My suspicion is that one system recognised it was a leap year and acknowledged 29 February and the other did not. Even when the jobs were ready to run in the right order, there was a ramp-up" time (the outage continues) to get the databases back in order.

Finally, an unusable website (and there are plenty about) is often classed as "down" by clients who, according to surveys, have a patience of less than seven seconds before going elsewhere.

In summary then, the emphasis on what is up (available) and down (not available) should focus on the service of which the system is a part and not the individual components.

There are outages to which there is no physical cause and these should be considered carefully in the design, implementation and operation of High Availability services and systems.


Dr. Terry Critchley is an author and retired IT consultant living near Manchester in the UK. He is currently working on a book on Service Performance and Management.

Read more on IT operations management and IT support