Think modular for effective recovery plans

There is no one-size-fits-all business continuity strategy, so think of disaster recovery scenarios as modules that can be invoked depending on the situation, says Josh Krischer


There is no one-size-fits-all business continuity strategy, so think of disaster recovery scenarios as modules that can be invoked depending on the situation, says Josh Krischer

There is no one-size-fits-all when it comes to developing business continuity strategies. Using someone else's requirements, which might turn out to be based on limitations or regulations that your company does not have, could spell disaster of another type.

Think of business continuity and recovery scenarios as modules that fit into a broader business continuity plan. When an incident occurs, it is mapped to the appropriate business continuity scenario, which then dictates the appropriate recovery plan modules to be invoked.

Modules can be reused for various business continuity scenarios. For example, certain types of disaster will involve making contact with external authorities, while others will not. Some types of disaster will require the involvement of a company's PR department, others will not.

Many companies think the end game for business continuity is to recover the technology infrastructure, such as network, telecoms, applications and desktops. Therefore they do a fine job in disaster recovery but when and if the time comes to execute the disaster recovery plan and use the recovery site for production processing, it may not be possible for business to be conducted. Small, seemingly unimportant things need to be taken into consideration by both the business and IT.

Business impact analysis is a critical step as it identifies what and how much the company has at risk, as well as which business processes are most critical, thereby prioritising risk management and recovery investment.

The business continuity team, which has to include the business process owners, must translate the business requirements into an overall business continuity plan. Three of the most important deliverables from a business impact analysis are:

  • Recovery time objective (RTO): the length of time between when a disaster occurs and when the business process must be back in production mode
  • Recovery point objective (RPO): the point in the business process to which data must be recovered after a disaster occurs. For example, the start of the business day, the last back-up or the last transaction that was processed
  • Cost of downtime: the business should calculate the potential losses incurred, both as the result of a disaster and in recreating lost data.

These considerations determine the technologies and methods used to support the disaster recovery plan.

Keeping data losses to a minimum is critical for some applications. But a more important issue is assuring data consistency and integrity at the recovery site.

If the data is not consistent at the recovery site, a time-consuming back-up is usually required, which may take days.

Companies need to understand fully how their chosen replication technology works, what its limitations are, and how it will react in various disaster scenarios, such as loss of network, physical site disaster, component failure and application failure. Only then can they put in place a strategy to assure data recovery with integrity, and still meet their RPOs.

Also, hunting down conflicting data and reconciling the status of key information can mean a much longer recovery time. Many companies mistakenly believe replication technology suppliers who say there will always be data consistency in a disaster.

There is no ideal distance between primary and disaster recovery datacentres. Rather, the best location is the one that minimises the risks at an acceptable cost and meets any required industry regulations.

Increasing the distance between the primary and secondary sites will mean higher telecoms costs and the deployment of appropriate techniques. It may also reduce performance and increase the chances of disruption. Users should invest in infrastructure to ensure availability of resources that are usually beyond their control. 

In most cases, regardless of the distance between the sites, each datacentre should have a separate main power supply (different providers or at least different transformers) and separate telecoms paths.

It would be even better if each datacentre had redundant power generators and an uninterruptible power supply. If both sites are connected by fibre optic cables, redundancy should be provided by using two separate routes.

It is important to maintain storage controller replication by keeping two copies at the recovery site: a main copy (target of the replication) and a point-in-time copy, or snapshot. If the remote copy operation is suspended, due to a transmission problem, for example, data in the primary site will be modified but not transferred to the secondary site. If the connection is re-established, re-synchronisation will send the modification but not in the order in which the writes were issued.

If a disaster strikes during the resynchronisation the data at the secondary site will be inconsistent. To avoid this situation before the resynchronisation a split between the target (secondary disc) and the point-in-time copy should be performed. Therefore, if a disaster strikes during resynchronisation, data on the secondary disc may not be consistent, but the point-in-time copy will contain the last consistent image. The local copy is re- established after resynchronisation is complete.

Another best practice is to perform the recovery process from the copy and not the original secondary disc because the data can be damaged during this process. If the recovery is done from the local point-in-time copy, it will not damage the source data and a new local copy can be made at any time.

Businesses should ensure that only the bandwidth in synchronous remote copy exceeds peak data transfer requirements. For asynchronous remote copy, the bandwidth for average activity is sufficient.

Josh Krischer is research vice-president at analyst firm Gartner


Developing a business continuity plan

Business continuity management and disaster recovery planning is hard work because it means addressing every aspect of business operations in the planning, development and testing phases

Start strongly - know what is required, and what is not, by conducting a business impact analysis

Apply an integrated business and IT approach for recovery plan development, management and testing

Reduce the maintenance of a continuity plan by using modular scenarios for disaster and recovery

Assure the integrity of data at the secondary site through proper planning and testing, and by keeping point-in-time copies.

Source: Gartner

Read more on Business continuity planning