Strategy Clinic: How can you test disaster recovery plans?

Our disaster recovery plan works in theory. I would like to be sure it works in practice. We have tested parts of it but the...

Our disaster recovery plan works in theory. I would like to be sure it works in practice. We have tested parts of it but the tests bear little resemblance to how a real crisis would affect us. Can the panel advise where the greatest vulnerabilities lie in continuity planning and how to test the plan thoroughly?

Tell the management and then switch it all off

There is only one way to test a disaster recovery plan: you turn all the systems off and then turn them on again. Obviously, choose the time of lowest demand but with the maximum time possible to unscramble if the systems do not hum quietly back to life. Back everything up and try to have a contingency plan if your disaster recovery does not do just that.

Inform top management in advance and get them to buy into the risk of such a trial against what would happen if it occurred "naturally", unscheduled and unannounced.

You will age five months in five hours while you are doing this, and age five years if those little lights do not flicker back on.

Robin Laidlaw, President, CW500 Club

Simulation can cause more disruption than real life

A real disaster will not be the same as any of the tests you have done. How much does that matter? How realistic does testing need to be?

Disaster recovery is part of the business' response to risk. It is an investment - often a large one - the purpose of which is to mitigate the risks. A risk analysis should be done first to establish the likelihood of an event taking place, the cost to the business of such an event and the amount and type of mitigation that is called for. All of these factors are variable, and many permutations are possible.

The cost of a disaster can then be compared with the mitigation cost. The relevance of this to your question is that the cost of even more testing may be greater than the additional protection it gives. Taking this to its logical conclusion, one could damage the business more by extensive, intrusive, simulated disaster testing, which causes more disruption than would be suffered in a real disaster.

Such a calculation may lead you to conclude that your disaster recovery plans, rather than being inadequate, are actually excessive, not in absolute terms but relative to what your business needs; what mitigation is actually achievable; and how much it costs.

Most businesses do not approach disaster recovery in this way. Generally this results in an inadequate level of disaster recovery preparation, rather different from the picture I have drawn here.

Roger Marshall, BCS Elite Group

Evaluate risk assessment and business impact

Key vulnerabilities in business continuity planning are the relevance of the plans to the organisation, the positioning of continuity and, as the question highlights, appropriate testing.

Many organisations believe they have covered business continuity because they have a published plan, but often these are out of date or do not have a nominated owner or method of review. Regular formal risk assessment and business-impact analysis should drive the critical elements of the plan and highlight the main areas that need testing. Where there are organisational changes, the impact should be revised and the plans updated - this should include changes in personnel.

Business continuity is often thought of as an IT issue but it should be considered from a business perspective. IT plays an important role in the provision of systems and services but plans must consider issues such as access to buildings, power supplies and external threats.

The process should involve individuals from all key business areas and highlight where interrelationships exist. A key distinction should be made between disaster recovery and longer-term business continuity, particularly with regard to responsibilities.

Testing needs to consider wider issues such as time of day and third-party reliance. Many tests are performed outside office hours to avoid disruption and may not truly reflect the working environment. Equally, if the plans are focused solely on the business and not other organisations in the supply chain, this could have a major impact on long-term recovery. Clearly a full test can be hugely disruptive, so consider testing components in isolation, but make sure it all fits together.

Richard Woods, NCC Group

Test everything, especially with end-users involved

You don't say which parts you have tested or how, so it is difficult to be specific. However, I would give the following general advice: 

  • Work on the basis that untested plans don't work, so not testing it because it seems difficult or costly should not be an option
  • Start with a rigorous desktop review and consider all the things that might go wrong, involving a variety of people from your business
  • The closer a test is to simulating a real-life incident the better, although this can be both expensive and unpopular. The more rehearsals there are, the greater the likelihood that people will respond correctly when the real thing happens
  • Be sure to have independent reviewers/observers with business continuity or disaster recovery experience to help with all phases of testing and design of improvements. Users may try to get away with small cheats or ask for hints in tests. It is important that no leeway is given - you really need to know what could go wrong.

Apart from lack of rigorous testing, other challenges and potentially therefore the greatest vulnerabilities are:

  • Lack of visible business sponsorship and ownership
  • Failure to agree cross-business on criticality of systems
  • Inadequate validation of required and feasible recovery timescales between IT and business units
  • Undue reliance on a plan that will not do what people assume it will because it has not been validated or tested.

John Butters, Partner, Ernst & Young

Take the scenario approach and carry out a dry run

IT disaster recovery and business continuity planning are very important topics at this time. Do you have responsibility for both the IT and the business aspects or just the IT elements? The tests will be more realistic if they combine elements of IT and business recovery.

Of course, one way to fully test the plan is to engineer a recovery situation of which people have no advance knowledge. However, this will affect the operational business process and be a high risk if your plans are not as effective as you hope.

Alternatively, a scenario-based approach to testing your plan would give you more confidence. Identify some scenarios that are most likely and some that would give you the most pain. You can now dry-run the probable effects and responses of these scenarios on selected areas of your business. Make sure you use someone independent of the original team to provide the challenges on whether your organisation is ready or not.

The results should identify potential weaknesses, which may include the plan, the roles and responsibilities, documentation and technology.

This is a good opportunity to engage with your business users and your IT suppliers. It will give a broader perspective and ultimately increase your confidence in the recovery plan.

Sharm Manwani, Henley Management College

Read more on IT risk management