A few postings ago, I mentioned the growing number of high-profile digital catastrophes reported in the media. And I wasn’t referring to natural disasters such as fire and flood or deliberate attacks such as hacking. What I was really concerned about was the type of increasingly spectacular glitch caused by simple, human causes, such as inadequate software testing, fat finger mistakes, bad change management or poor data quality. These are the things we generally class as “cock-ups” rather than “conspiracies”. They are the result of accidental rather than sinister actions.
One would hope, after all these years of designing and operating IT services, that we should be able to deliver services that are highly reliable. Unfortunately it’s not always the case. In recent months we’ve seen failures of supposedly bullet-proof Cloud services and extended outages of major banking services. But that’s just the tip of the iceberg. Behind every major incident are dozens of near misses, hundreds of minor incidents and thousands of bad practices.
Why is this continuing to happen? Several trends are behind this. Hardware might be a little more reliable (though not always) but systems and infrastructure are becoming increasingly complex and harder to integrate. Project deadlines are becoming shorter because of the continuous pressure from business management to move faster and faster. There’s also relentless pressure to cut costs resulting in greater demands on resources and constantly changing supply chains. Add to this the usual elephants in the room that nobody wants to tackle such as data quality (for which there no standards) and intrinsically insecure legacy assets, and it’s a wonder our systems manage to stay up as much as they do.
Yet this is a world moving to Cloud Computing, where we might reasonably expect better than ‘five nines’ service availability to keep out businesses running. A major issue is that business continuity planning is difficult and expensive for users of Cloud services. They will have few, if any, alternative sources of identical services. And switching is far from easy. Try asking a Cloud service provider how to plan for a major outage and you’ll be lucky to get a sensible answer that even acknowledges the problem.
So what can be done? Here are a few ideas. Firstly, accept that no service is invincible: they are all vulnerable to deliberate and accident incidents. Increasing centralisation of service delivery and a growing reliance on monoculture (use of identical components and practices) is also raising the stakes by increasing the global impact of a failure. The bigger and more widespread they are the harder they will fall. And credits for missed service levels are no substitute for lost business and damaged reputation.
Secondly, treat outages and security events like safety incidents. Monitor the minor incidents and conduct a root cause analysis for near misses and common sources of failure. There’s no such thing as an isolated incident. Examine your own operations and dig into your service provider’s history. Many well-known service providers fall well short of customer expectations.
Thirdly, draw up a ‘catastrophe plan’. And I don’t just mean a disaster plan, which generally involves recovering from a fire or flood. I mean a full-blown catastrophe plan based on a “worst of the worst” complete or extended loss of service or data. It will demand imaginative thinking and preparation, for example ideas to speed up the recreation of databases from scratch, alternative sources of essential management information, and proactive plans to reassure customers that everything is being done to protect their interests.
Fourthly, make your own personal contingency plans. Make sure you can work offline. Carry a decent amount of cash. Top up your petrol tank. And keep a torch, maps and compass in your briefcase. Because, like it or not, we are entering an information age in which business and life will become increasingly volatile, and major crises will become more commonplace.