Disaster recovery testing: A vital part of the DR plan

Disaster recovery provision is worthless unless you test out your plans. In this two-part series, Computer Weekly looks at disaster recovery testing in virtualised datacentres

Chris Evans

Published: 30 Nov 2016

IT has become critical to the operation of almost every company that offers goods and services to businesses and consumers.

We all depend on email to communicate, collaboration software (such as Microsoft Word and Excel) for our documents and data, plus a range of applications that manage internal operations and customer-facing platforms such as websites and mobile apps.

Disaster recovery – which describes the continuing of operations when a major IT problem hits – is a key business IT processes that has to be implemented in every organisation.

First of all, let’s put in perspective the impact of not doing effective disaster recovery.

Estimates on the cost of application and IT outages vary widely, with some figures quoting around $9000/minute.

Obviously, the actual cost varies by organisation, but it’s not difficult to see the impact of lost business for web-based transactions, or lack of productivity for internal staff. Prolonged outages and loss of data directly result in companies going out of business, as was seen in the Buncefield fire in December 2005.

The unavailability of IT systems can also lead to reputational damage that results in customers going elsewhere and a long-term loss of revenue to the business.

The main aim of having a disaster recovery strategy is to recover quickly from a major IT disaster and ensure that the operation of the business continues with the minimum amount of disruption.

A disaster recovery plan describes the steps executed to recover IT systems when a disaster occurs. Both the strategy and plan will be discussed in this article.

Defining a disaster

It’s important to understand what defines a disaster. Data protection and recovery might be used every day – for example, in recovering an individual file or an entire application or server – but this isn’t necessarily a disaster.

However, the failure of an entire storage array or VMware cluster may constitute a problem big enough to invoke the disaster recovery plan.

This makes it imperative to be clear about when disaster recovery is or isn’t being invoked.

The impact of declaring a disaster may mean failing over systems that are currently running without problems but have to be recovered to another location due to application or latency dependencies.

Creating a disaster recovery strategy

Building a disaster recovery strategy starts at the business level, by determining the criticality of applications to the business.

While it may be desirable to recover all applications as quickly as possible, the recovery process has to be prioritised against those applications and services that are the most important to running the company.

The cost of implementing disaster recovery is directly affected by the level of recovery required so, to contain costs, applications have to be prioritised against a set of metrics that determine recovery requirements.

Recovery time objective (RTO) describes the amount of time a business application can tolerate being unavailable, usually measured in hours, minutes or seconds.

We can imagine applications that deliver core banking for financial organisations have an RTO=0, whereas some back-end reporting functions may have an RTO of up to 4 hours.

Recovery point objective (RPO) describes the previous point in time from which an application should be recovered.

To use our banking example again, an RPO of zero will be expected for most applications – we don’t want to accept any lost transactions.

Alternatively, our reporting application that takes data from other systems may be able to tolerate an RPO of 24 hours, meaning the data recovered is 24 hours out of date, but the lost information can be obtained elsewhere.

Actual RTO/RPO values depend entirely on the business specifying their requirements and negotiating with IT on what can be delivered at an acceptable cost. RTO and RPO form part of an overall service level agreement (SLA) that outlines service level objectives (SLOs) with regard to how applications are provided to the business.

For each application, the SLO will provide metrics for uptime of the application, plus RTO/RPO targets for recovery in the event of a disaster. Applications can then be prioritised and technology implemented to allow each of these service level objectives to be achieved.

Building a plan

Now we have a strategy on how applications should be recovered, IT can work to build a plan on the actual process of implementing data protection and the disaster recovery process.

When building a disaster recovery plan, there are some key points to consider:

People – Bear in mind that the people who perform the recovery may not be the same ones who developed the plan. There’s no predicting when a disaster may occur, so key personnel may have moved on, be on holiday or be affected by the disaster. Any plan should be executable by anyone with the right level of authority.
Security – Be aware that recovery personnel may require physical and online access to systems to execute the disaster recovery plan. The plan documentation should explain how these credentials can be obtained.
Accessibility – The actual recovery plan needs to be available for use when required. This means storing a copy away from systems that are written into the plan itself, perhaps at another location or in an online repository.
Concurrency – Ensure that the execution of the plan doesn’t depend (too much) on shared resources. For example, recovery from tape can only be done one task at a time; if many restores require the same tape, then those restores will be queued behind each other, resulting in an impact on RTO.

Disaster recovery plan testing

All disaster recovery plans are validated through testing.

Tests provide proof the plan is workable and can be achieved in the timescales set out by RPO/RTO requirements. They also provide a feedback loop that allows disaster recovery plans to be amended with unexpected or unplanned issues.

It’s also important to bear in mind that IT systems change continually. There is constant data growth, new applications are implemented and existing applications may be updated at frequent intervals.

Running disaster recovery tests should consider the following aspects:

Time – How long has it been since an application was tested? The more time passes, then the risk of failure increases due to application change and growth.
Change – When major changes in infrastructure occur, tests may need to be performed or even rewritten. For example, if storage hardware is changed, then tests will need to be re-done. If a hypervisor upgrade occurs, tests need to be done to ensure the backup/restore process hasn’t been affected.
Impact – Running disaster recovery tests can be impactful, both from a time perspective (if the application is required to be down during the test) and also from that of risk if live data is being restored elsewhere. Disaster recovery testing needs to include a decision on whether to discard or bring back changes made in an application during the disaster recovery process. It makes sense to incorporate application data updates, otherwise this doesn’t reflect a valid test of the disaster recovery process.
People – Consider running some tests with people other than those who directly support the application. These kinds of tests don’t always have to be with live systems and can be done purely as a paper exercise to test the validity of the recovery steps.

The ability to test a disaster recovery plan is a critical requirement.

Looking back 10 to 15 years ago when applications were physically deployed on individual servers, the testing process was invasive and usually involved an outage.

Today, with hypervisor-based data protection and a range of disaster recovery tools, we can perform testing with minimal impact.

In the next article of this series, we will discuss what tools are available and how they can be used to implement and validate a disaster recovery strategy, specifically addressing the requirements we have outlined here.