Essential guide to disaster recovery and business continuity
A comprehensive collection of articles, videos and more, hand-picked by our editors
Once you have drawn up a detailed disaster recovery plan, the next stages in the project are twofold: to prepare and deliver disaster recovery awareness and training programmes so all employees are prepared to respond as required by the plan in an emergency, and to to carry out disaster recovery testing to ensure the plan works properly and that DR teams know their roles and responsibilities.
In this article we’ll reference a specific international standard:
- ISO/IEC 27031:2010, Information technology–Security techniques–Guidelines for information and communication technology readiness for business continuity
This is the global standard for IT disaster recovery as it applies to end users. Another ISO standard, ISO/IEC 24762, addresses Information and communications technology disaster recovery from a service provider perspective. Both these standards can help you develop and implement ICT disaster recovery programmes.
According to ISO 27031 Section 7.5, “A coordinated program should be implemented to ensure that processes are in place to regularly promote ICT DR awareness in general, as well as assess and enhance competency of all relevant personnel key to the successful implementation of ICT DR activities.”
Perhaps the most important strategy in raising disaster recovery awareness is to secure senior management support and funding for DR programmes. Visible and frequently occurring endorsements from senior management will help raise awareness of and increase participation in the programme.
The next key strategy is to engage your human resources (HR) organisation in the process. They have the expertise to help you organise and conduct awareness activities, such as department briefings and messages on employee bulletin boards. You can also encourage HR to incorporate briefings on DR as well as business continuity into new employee induction programmes.
Another important strategy is to leverage the Internet. If your organisation has an intranet, launch a DR page that describes what your programmes does; answers FAQs; and provides links to forms and services, schedules, and other relevant materials.
Be sure that any awareness activities are approved by management and HR, as well as your own IT management. Your messages should be informative and educational and should reinforce the company’s commitment to IT DR activities.
Here are additional activities for successful disaster recovery awareness and training programmes:
- Conduct an awareness and training needs analysis.
- Assess existing staff competencies regarding roles in DR plans.
- Establish an ongoing awareness and training programme.
- Establish record-keeping of staff training and awareness activities.
- Establish competency levels for IT staff and how they should be maintained.
- Conduct staff performance assessments post-disaster and re-evaluate training.
As part of these activities, you should develop and conduct training on:
- Technical recovery activities
- Emergency response activities, for example, situation assessment and evacuation
- Specialised recovery, such as recovering to hot sites or cold sites or third-party managed DR services
- Return-to-normal activities
- Restoration of business systems and processes
Since you will be working with a variety of vendors and specialised service providers, examine their training programmes to see if they can be leveraged into your internally developed training activities.
The most important strategy in disaster recovery testing is simply to test, test and test again. Your organisation depends on the availability of IT systems and networks, so it’s critical that those systems not only remain operational but that they can survive an unplanned outage. Disaster recovery testing will ensure that all your efforts to provide recovery and resilience will indeed protect critical IT assets.
A key point to remember in testing is stated in ISO 27031: “In most instances, the whole set of IRBC [ICT readiness for business continuity] elements and processes, including ICT recovery, cannot be proven in one test and exercise.” Testing, therefore, needs to be a structured programme that continually addresses the entire spectrum of operational and administrative activities that an ICT organisation faces.
Based on the size and complexity of your IT infrastructure, disaster recovery testing activities should address recovery of hardware, software, data and databases, network services, data centre facilities, people (for example, relocation of staff to an alternate site), and the business. For each of these factors, critical information will be identified in the business impact analysis, or BIA.
ISO 27031 makes some key points with regard to disaster recovery testing:
“There are risks associated with tests and exercises, and such activities should not expose the organisation to an unacceptable level of risk. The test and exercise programme should define how the risk of individual exercise is addressed. Top-management sign-off on the programme should be obtained and a clear explanation of the associated risks documented.”
“The test and exercise programme objectives should be fully aligned to the wider business continuity management scope and objectives and complementary to the organisation's broader exercise programme. Each test and exercise should have both business objectives (even where there is no business involvement) and defined technical objectives to test or validate a specific element of the ICT DR strategy.”
Since there are many aspects of an IT environment to be tested, there are different kinds of tests to be initiated. This figure shows the three basic IT DR tests.
Types of IT disaster recovery tests
Basic disaster recovery testing begins with a desktop walk-through activity, in which DR team members review DR plans step by step to see if they make sense and to fully understand their roles and responsibilities in a disaster.
The next kind of test, a simulated recovery, impacts specific systems and infrastructure elements. Specifically, tests such as failover and failback of critical servers are among the most frequently conducted. These tests not only verify the recoverability of primary and backup servers but also the network infrastructure that supports the failover/failback and the specialised applications that effect failover and failback.
Operational exercises extend the simulated recovery test to a wider scale, typically testing end-to-end recovery of multiple systems, both internal and external, the associated network infrastructures that support connectivity of those assets, and the facilities that house primary and backup systems. These tests are highly complex, and provide a higher level of risk compared to other tests, as multiple systems will be affected. Loss of one or more critical systems from this kind of test could result in a serious disruption to the organisation.
Tests have several key goals, as stated in ISO 27031:
- Build confidence throughout the organisation that resilience and recovery strategies will satisfy the business requirements.
- Demonstrate that critical ICT services can be maintained and recovered within agreed service levels or recovery objectives regardless of the incident.
- Demonstrate that critical ICT services can be restored to pre-test state in the event of an incident at the recovery location.
- Provide staff members with an opportunity to familiarise themselves with the recovery process.
- Train staff and ensure they have adequate knowledge of ICT DR plans and procedures.
- Verify that ICT DR plans are synchronised with the ICT infrastructures and business environment.
- Identify opportunities for improving ICT DR strategies or recovery processes.
- Provide audit evidence and demonstrate the organisation's ICT service competence.
IT disaster recovery testing plans provide a step-by-step process for:
- Setting the stage of the exercise by defining the test scope
- Defining test objectives
- Defining success criteria
- Defining the ICT assets to be tested
- Defining the roles and responsibilities of test participants
- Defining exercise steps in a logical sequence, plus unannounced injects that challenge the delegates in how they respond to unanticipated changes
- Conducting a post-test review of what worked, what did not and lessons learned
- Revising the DR plans based on test results
- If possible, retesting the plan to ensure the changes work as intended
The following list provides a suggested table of contents for an IT DR test. Before you reach the actual test, there’s a lot of work that needs to be completed, such as researching the systems to be tested, researching existing recovery procedures, identifying test scripts (if any), creating and approving test scripts, coordinating with other IT departments and business units in the company, and coordinating with external vendors and service providers.
Once your DR plans have been tested and your awareness and training plans have been initiated, the next steps are to initiate a maintenance programme and initiate an audit and review programme. The first ensures all the previous DR activities we have been discussing are scheduled for annual or semiannual review, testing and updating. The second ensures that all DR programme activities are aligned with established policies and operational controls. Another part of the audit process is to establish a process of continuous improvement. This ensures that DR programmes remain aligned to the business as well as international standards and good DR practice.