Learn from the worst IT mistakes

IT has become an integral part of all our lives. We depend on it to such an extent that when it goes wrong the least it will cost...

IT has become an integral part of all our lives. We depend on it to such an extent that when it goes wrong the least it will cost is an awful lot of money. At worst, IT failures can, and do, cost lives. In many cases these outcomes should have been avoidable. Ross Bentley looks at some of the more significant IT disasters and draws out the lessons for all IT directors

No margin for error in aircraft IT
Computer Weekly's campaign to re-examine the 1994 Chinook helicopter crash which killed 29 people on the Mull of Kintyre is a far from isolated case implicating suspect IT systems in air disasters.

In 1988, Air France's new European A320 Airbus (right) crashed into trees at an airshow near Mulhouse in France. Three passengers - a woman and two children - were killed. Although the pilot was dismissed and stripped of his licence, he claimed he was misled as to the aircraft's true height by a bug in the software.

In 1994, when another Airbus, this time belonging to China Airlines, ignited in mid-air killing 264 passengers, the inability of the pilots to read some of the system read-outs and interfaces was cited as a cause.

When Denver airport finally opened in 1995, it did so a year behind schedule. The postponement was due to problems with the automated baggage handling system - a delay that reportedly cost the city about $1m per day.

The UK's new air traffic control system finally went live at Swanwick this year, 15 years after it was conceived and six years after the first promised date. The system took longer to plan and build than it will be in operation.

Arguably the most bizarre aeroplane-related cock-up took place in 1983 when a project to equip an RAF Nimrod with computerised air-defence early-warning systems was abandoned when it was found that the computers were too heavy for the purpose-built bulbous nose section of the aircraft. About £800m of taxpayers' money was wasted.

Lesson: Ostriches are flightless birds - adopting a head-in-the-sand attitude to IT problems on aircraft renders them flightless too, and may cost lives.

We beat the Millennium Bug but now face nearly 60,000 viruses
The past few years have been pivotal in the evolution of malicious software. The different strains - be they viruses, worms, or Trojans - have blurred and amalgamated while the widespread use of e-mail has enabled these new ultra-virulent pieces of malware to spread with ease.

The innocent-sounding Melissa, a Microsoft Word macro virus, appeared in 1999, while 2000 brought us the I Love You message and its many variants. The Love Bug, as it came to be known, appeared on more than 500,000 computers worldwide and caused more than $10bn worth of damage in the US alone as systems ground to a halt.

Last year, more than 93% of large companies and government agencies detected virus attacks and we saw the arrival of more sophisticated malware in the form of the Code Red and Nimda worms. These network-aware baddies spread themselves worldwide to the point where Code Red brought down the US Internet backbone.

Anti-virus organisations are engaged in a constant war with virus writers to identify and negate the latest entrants to the network. Andre Post, a senior researcher at Symantec, says he sees between 10 and 20 new viruses each day. Add that to the 59,000 viruses already out there and you get some idea of the threat faced by today's computer systems.

Lesson: Nowhere is "safe". Try to stay one step ahead of the bad guys but never assume that you are achieving that lead.

Ambulance service breakdown highlights risk of rushing projects
In November 1992 the chief executive of the London Ambulance Service resigned after a succession of glitches in London's ambulance computer aided dispatch system led to delays of up to three hours in ambulances reaching emergency patients.

The repair cost was estimated at £9m, and it is likely that people died because ambulances failed to arrive promptly. Virginia Bottomley, health minister at the time, was forced to announce an external inquiry into events on Monday and Tuesday 26 and 27 October when the problems occurred.

Investigations unearthed a catalogue of errors that have been documented elsewhere and stand as a lesson to IT project managers. The system had been implemented against an impossible deadline and neither software nor hardware had been properly tuned or fully tested. Staff, both in the central control teams and ambulance crews, were not all fully trained and management had underestimated the difficulties involved in changing the deeply ingrained culture of the service.

Lesson: Without user buy-in even the best managed projects will fail. This one was not well managed and no outside help was sought.

Overdependence on automated trading contributes to Black Monday crash on Wall Street
The largest stock market fall in Wall Street history occurred on "Black Monday" - 19 October 1987 - when the Dow Jones Industrial Average plunged 508.32 points, wiping 22.6% off its total value.

That fall far surpassed the one-day loss of 12.9% that began the stock market crash of 1929 and foreshadowed the Great Depression. The Dow's 1987 fall also triggered panic selling and similar falls in stock markets around the world.

In searching for the cause of the crash, many analysts found fault with "program" trading by large institutional investing companies. This is where computers were programmed to automatically order large stock trades when certain market trends prevailed. In response, the New York Stock Exchange (NYSE) restricted some forms of program trading.

The NYSE and the Chicago Mercantile Exchange also instituted a "circuit breaker" mechanism in which trading would be halted on both exchanges for one hour if the Dow Jones average fell more than 250 points in a day, and for two hours if it fell more than 400 points.

Six years later Taurus, the planned automated transaction settlement system for the London Stock Exchange, was cancelled after five years of failed development. Losses are estimated at £75m for the project and £450m to customers.

Lesson: Remember that important processes will always need to be moderated by human intelligence. Revisit automated systems regularly and remember to ask the "what if" questions.

US elections - failed IT adds to confusion
The 2000 US election, the tightest presidential contest in living memory, was prolonged in part because of the failure of technology. As the votes came in it became clear that the battle was going down to the wire. In the end, the world's attention was focused on Florida, the last state to declare a victor.

Many of the delays were caused by the inability of punch-card voting machines to read badly-punched cards.

But in Volusia county, Florida, delays in tabulating the election results were due to a mechanical failure in the automated signature verification system. As a result, election officials were forced to manually check the voter signatures on approximately 3,000 absentee ballots which came in just before the deadline.

"In today's world we rely heavily on technology and when there's a mechanical problem it really slows things down," said Elenor Lowe, voting supervisor.

Demands for manual counting across the state were met with appeals to the Supreme Court as the contest to find the most powerful man in Western politics descended into chaos. In the end, we all know that George W Bush emerged as victor. Who he will stand against next time is yet to be decided but by that time we will have to contend with e-voting.

Lesson: Contingency planning - where was the contingency planning? Plan for systems to go wrong, whether the eyes of the world are on you, or just the eyes of the board.

Patriot problems allow killer Scud in
On 25 February 1991, during the Gulf War, an American Patriot Missile in Dharan, Saudi Arabia, failed to track and intercept an incoming Iraqi Scud missile. The Scud struck an American army barracks, killing 28 soldiers and injuring about 100 other people.

But why had the anti-missile system failed? It turned out that the cause was an inaccurate calculation of the time due to computer arithmetic errors. With a Scud travelling at about 1,676 metres per second the slightest miscalculation will have major consequences.

Eleven years later in 2000 the United States had to replace hundreds of Patriot anti-missile systems stationed in the Gulf and South Korea after faults were discovered in those left on high alert.

At the time Lieutenant General Paul Kern of the US army said the glitch might have been caused by leaving the missiles on "hot status" alert for more than six months at a time.

Tests had shown that missiles kept constantly on high alert have developed problems in receiving a radio frequency downlink, which guides the missiles in flight.

General Kern said the Patriot's manufacturer, Raytheon, had guaranteed that the missiles would work properly if on high alert for a maximum of six months.

Some Patriots have been kept on high alert for years. Repair costs were estimated at between $80,000 and $100,000 per missile.

Lesson: Read the guarantee and do not expect any system to exceed guaranteed standards. The more critical the system the more important this is.

BSE test project failure leads to preventable slaughter of 100,000 cattle
In 1994 Computer Weekly reported that government scientists trying to eradicate BSE and salmonella had secretly scrapped two years' work on a £1.2m computerised testing system.

The Sample Management System was commissioned by the Central Veterinary Laboratory (CVL), a government agency which detects tracks and combats animal diseases on behalf of the Ministry of Agriculture (formerly the Ministry of Agriculture, Fisheries and Foods, now a subsection of the Department of Trade & Industry). The system was partly intended to deal with epidemics by speeding up the reporting and tracking of tests on animal organs, blood and urine. It was ordered in 1990 and abandoned late in 1992. The software supplier ACT was accused by the CVL of a breach of contract and the dispute was ended by a legal settlement which swore both sides to secrecy.

The BSE epidemic peaked in 1992, when almost 45,000 cases of BSE were reported in the UK. Since then over 100,000 cattle have been slaughtered because they were considered to be at risk from the disease. Just how many of these could have been avoided if a computerised detection system had been in place we will never know.

Lesson: Secret settlements may hide flaws in your planning or design of a system. They also prevent others learning from the mistakes you have made.

Babbage can't stop tinkering so the first computer is never finished
Even Charles Babbage, 19th century politician, industrialist and to many the father of computing, ultimately failed in his quest to build an analytical engine.

Financed through government grants, he intended to use punched cards to control an automatic calculator. The machine was designed to employ several features subsequently used in modern computers, including sequential control, branching and looping.

Babbage worked on his analytical engine from 1830 until his death in 1871 but it was never completed. In his book, Engines of the Mind, Joel Shurkin wrote, "One of Babbage's most serious flaws was his inability to stop tinkering. No sooner would he send a drawing to the machine shop than he would find a better way to perform the task. By and large this flaw kept Babbage from ever finishing anything."

Less charitable was the Reverend Richard Sheepshanks, then secretary at the Royal Astronomical Society . He wrote, "We got nothing for our £17,000 but Mr Babbage's grumblings. We should at least have had a clever toy for our money."

Lesson: Learn when to stop and let a piece of work go. It may not - perhaps ever - be finished but you will be free to move on to the next big thing.

Government IT money wasted
A recent report in Computer Weekly showed that NHS staff fear that the extra £1bn funding ring-fenced for IT projects within the health service will be squandered unless management practices change.

Like all public service improvements, dragging the NHS into the modern age will depend heavily on the IT and we can only hope that the reformation of the NHS does not end up as the latest in a long list of government IT horror stories.

Lesson: There is no substitute for professional expertise when drawing up contracts. Political optimism is certainly no alternative.

Read more on IT risk management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.