A combination of a number of unexpected events led to a business continuity nightmare for the London Stock Exchange. Here its IT director shares the lessons learnt
Terrorist attacks and natural disasters grab the headlines, but most business continuity planning addresses the mundane systems failure that disrupts business as usual.
Although mundane, such failure can severely damage an organisation. When an IT glitch forced the London Stock Exchange to suspend trading on 5 April 2000, it could hardly have happened at a worse time, recalls former IT director, Jonathan Wittmann.
"It was the last day of the tax year, and one of the most important days in our calendar - a lot of private investors sell their shares to get their year's tax allowance, and then buy them back in the new tax year. It's a highly visible day for us," he explained.
The normal daily routine at the stock exchange is for the Sets electronic trading system to go offline when the market closes at 4.30pm, after which the end-of-day processing and overnight batch runs take place. At 4am the systems start downloading the reference data ready for when the market opens again at 8.30am.
"But from early that day, customers were calling in saying they were seeing prices from the day before going out as current prices," said Wittmann. "None of the systems had fallen over, yet rogue data was appearing. You can't run the market if the data is wrong, so we had to keep the market closed until we could find the source of the corruption. It was a major decision."
And with more than 100,000 trading screens around the world displaying the stock exchange's data, it was an extremely visible decision.
Inevitably, the closure hit the press immediately and ultimately led to questions from the Treasury Select Committee and an independent third-party review of the stock exchange's systems. But on the day itself, the priority was to discover the cause of the rogue data and correct it.
"If a server falls over, the monitoring tools spot it immediately and everyone is trained to handle it," said Wittmann. "But incorrect data can be much more difficult to track."
What caused the corruption was a series of events, each of which was perfectly valid in itself. No machine or system failed, nothing illegal happened, yet there was a real problem.
The events had generated an unusually large amount of data, causing the batch to run for an exceptionally long time. It hadn't finished by the start of the day at 4am, so some old data from the previous day was carried forward.
It took until 3.30pm for the market to reopen with the correct data, and it remained open until the middle of the evening to clear the backlog.
"Since the introduction of Sets in 1997 we had only had one other incident, and that had only lasted a few minutes," Wittmann said. "It shows that however much you spend on things like fault tolerance, a combination of events out there can bring your systems down - and you cannot predict what they might be."
What can - and must be done - is to prepare and learn.
One obvious lesson for Wittmann was to make technical improvements and review operating processes and systems so that, for example, the batch run had to complete before the start of processing for the next day.
Testing the continuity plan is another essential, but many continuity plan tests are very all or nothing and usually involve loss of site.
"These kinds of tests are fine as far as they go, but they have a drawback - their scale. And the effort they involve means that you can only do a few of them, and they tend to test the big things. So we also devised a system of micro-tests, which are scripted in advance and designed to test a very specific incident, such as a server failing at 7.30am, or an external event such as a Tube strike.
"You need to involve everyone who would be affected by such a failure, which could be from the chief executive down. But although you warn them there will be a test on a particular day, do not tell them the time or nature of the test," Wittmann said.
Micro-tests can be as simple or as complex as desired, but the biggest benefits come from monitoring them independently and objectively and feeding the results back into the continuity planning.
Wittmann also advised IT directors to consider setting up a war room, with phone and internet access independent of internal systems, copies of system documentation, staff and supplier contact numbers, process flows, and so on.
"It can save a great deal of time and energy and enable a much more rapid response," he said. "But test out the facilities regularly. It's no use if you have to brush cobwebs aside to get in."
Some crisis management issues can seem trivial but are not.
"I was about to board my train at Sevenoaks at 7.30am when my phone rang and as part of a micro-test I was told we had a problem," said Wittmann. "I had to decide whether to board the train or not."
The correct decision was not to - not least because a phone conversation on the train about problems at the stock exchange would have been audible to other travellers. But also because Wittmann needed to start a conference call, and would not have been able to hear on a train. "So I went somewhere quiet, did the call, and caught the next train in."
Settling such minor issues in advance means you can focus on the big issues in a real crisis.
The biggest benefit of a continuity plan, said Wittmann, is that it buys you time. The best continuity plans are generic-telling you who needs to be involved, what facilities they will need and how they will access them - rather than telling you exactly what to do in every circumstance.
You also need to re-evaluate your plan on a routine basis, asking yourself whether what you put in place two years ago is still valid.
"But the hardest lesson of all is how to maintain vigilance when everything seems to be running smoothly - how to spot a problem before it becomes a crisis. There's no good answer, but it is the key to continuity planning."
Pointers for continuity planning
- Watch out for insidious data corruption even though all systems are functioning normally
- Consider a war room for crisis management that has independent communications and all IT documentation and contact numbers
- Prepare for adverse publicity, both within and outside the organisation, adding to the pressure of recovering operations
- Test business continuity planning regularly on both a large and small scale
- Eternal vigilance isn't easy but it is the price of business continuity