leungchopan - Fotolia

The truth about virtualisation and disaster recovery

The fundamentals of disaster recovery are well-established. But there is uncertainty, and even false claims from suppliers, about how the rise of virtualisation affects DR

With the advent of server virtualisation, some have claimed disaster recovery (DR) and business continuity (BC) are easier to implement, or even that they are a thing of the past.

It's easy to see how this situation could have arisen. Talk to any virtualisation administrator and the idea of using traditional backups and tapes seems like a distant memory.

Since the early days of virtualisation, we've seen the introduction of high-availability clustering and features such as vMotion that seem to resolve the majority of data protection requirements.

But, look a little closer and we discover that these claims are somewhat exaggerated.

Defining disaster recovery

It's always good to establish a baseline set of definitions, and that's imperative in disaster recovery planning -- to which you can find a comprehensive guide here. BC and DR are slightly different in their definitions:

  • Business continuity describes the process of ensuring the business continues to operate.
  • Disaster recovery describes the process of bringing back services after a disaster has struck. In most instances, we'd want to avoid that scenario as much as possible, or at least minimise it.

There are two other definitions widely used in disaster recovery/business continuity, and those are recovery time objective (RTO) and recovery point objective (RPO) -- both establish a service level for data recovery:

  • Recovery time objective determines the expected or target time required to recover business back to normal operations. An RTO of zero implies services must be restored immediately, or have no outage. An RTO of one hour requires the application to be restored to normal use within an hour.
  • Recovery point objective expresses the point in history to which that service should be restored. An RPO of zero implies the service must carry on from where it left off with no loss of data (imperative for a banking application). An RPO of 24 hours implies yesterday's data is good enough (acceptable for a test/development environment).

Building a disaster recovery plan

In an ideal world, all applications would be restored with RTO = zero and RPO = zero, but the practicalities and cost of this would be untenable.

Instead, the starting point in any BC/DR plan should be to establish the disaster scenarios, rate their importance and impact, and apply that to each application. For example, failure scenarios include:

  • Loss or damage to systems (fire, power failure, flood, earthquake).
  • Inability to access facilities (fire, flood, hazard -- chemical, radiation).
  • Criminal or malicious damage (disgruntled employees or cyber attacks).
  • System or application failures (software bugs, failed upgrades, data corruption).

We can rate each application impact of failure and use that to determine our RTO/RPO service level objectives (SLOs). Again, here are some examples:

  • Email system: Impact of downtime -- high; RTO = 30 minutes, RPO = zero.
  • Core banking app: Impact of downtime -- critical; RTO = zero, RPO = zero.
  • Overnight reports app: Impact of downtime -- low; RTO = four hours, RPO = 24 hours.

In today's always-on world, most customer-facing apps are expected to run 24/7. This can skew some of the SLOs, but application design can minimise this by separating out front-end access from back-end features.

Clearly, whether an application is virtualised or not, and independent of the technology, the business requirements for recovery have to be established first. In fact, the business should always provide guidance on their requirements rather than have IT impose standards, which has been the traditional mode of operation.

How virtualisation can help disaster recovery

Virtualisation abstracts the physical resources of the server into logical constructs that represent hard drives, network cards and disk controllers. Processors, memory and network ports are represented by parameters in a configuration file and hard drives are represented by files on local or shared storage.

Therefore, backing up a virtual machine (VM) is simply a case of taking a copy of the files and the configuration data. In addition, moving a virtual machine to alternative hardware can be achieved even if the physical hardware isn't identical. This makes it much easier to manage hardware failure issues. Virtualisation features solve the problems of BC/DR in the following ways.

Simple backup/restore

Recovery is often based around the need to create a backup and restore from those copies in a disaster recovery scenario. To meet this need, hypervisors provide features that allow backups to be taken by copying VM contents. To ensure there is data integrity, virtual machines run software agents or tools that quiesce or suspend I/O while a copy of the VM files is taken.

A simple backup can be used to provide recovery at file, application or VM level, depending on the sophistication of the backup software. Backing up an entire VM snapshot can be impactful so some systems allow the snapshot capabilities of the storage system to be combined with hypervisor snapshots to offload the processing work while maintaining data integrity.

VM migration

Although not strictly a data recovery process, the ability to move virtual machines dynamically between physical hardware provides the capability to reduce the impact of hardware failures. VM migration doesn't protect against server failure, but can be used to move VMs when partial failures are experienced, either in the server or other components (such as the network).

VM migration can also be used as a controlled process when services have to be moved off a piece of hardware (for maintenance) or to mitigate the risk of a datacentre failure (such as an impending storm or hurricane that might affect datacentre operations). In this sense, VM migration is more akin to business continuity, ensuring that servers continue to run, even in the event of a potential or actual incident.

High availability/fault tolerance

These are features of the hypervisor that enable a virtual machine to run in case of a hardware failure. Two levels of service are provided. High availability monitors virtual machines and will restart them on alternative hardware in the case of a server failure. This results in a small outage as the application restarts. The other runs a ghost VM image on alternative hardware, instigating that image as the production service in the case of server failure, typically with no application outage.

The ability to use these features may require shared storage hardware (to store the VM configuration and data) and will, of course, be chargeable. Some suppliers support the ability to use array-based replication in conjunction with high availability/fault tolerance features. This allows the hardware configuration to span a short distance (hundreds of metres) and create a metro cluster. Metro clusters mitigate datacentre outages or serious hardware failures without the need to deploy complex application clusters.

Continual backup

Some third-party applications use virtualisation to intercept write I/O for a virtual machine and create a remote backup image. This process occurs asynchronously and provides a failover copy of the VM that can be powered up and used in the event of a failure at the primary site. The RPO of an application using this kind of system will be dependent on the speed at which data can be moved off-site.

Application resiliency

As discussed, virtualisation provides benefits through the abstraction of physical hardware components. One other consideration in implementing disaster recovery/business continuity is to build recovery capabilities directly into the application itself.

Application resiliency is achieved by running many instances of the application, each of which can fail and be restarted on alternative hardware. This kind of design isn't directly dependent on virtualisation, but can work well where multiple hypervisors and hardware configurations are implemented. In the future, we will see BC/DR resiliency implemented using containers, a form of application virtualisation that is at the beginning of widespread adoption.

With reference to disaster recovery/business continuity principles such as RTO and RPO, we can apply recovery options to the requirements of the application. Some applications will get full high availability/fault tolerance, whereas others may simply be backed up using hypervisor snapshots. In some instances, continuous backups or full high availability/fault tolerance with array-based replication can be justified. It's simply a case of applying the technology to the requirements.

Read more about disaster recovery

  • In this series of articles on IT disaster recovery policy, we'll walk you through the process of developing disaster recovery provision, from initial risk assessment to maintenance and continuous improvement of your plans.
  • Jon Toigo argues against virtualisation advocates that say the software-defined datacentre, with its high availability and clustering, does away with the need for disaster recovery.
This was last published in September 2015

CW+

Features

Enjoy the benefits of CW+ membership, learn more and join.

Read more on Disaster recovery

Join the conversation

3 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Nice layout of core principles around DR and virtualization. I'm especially interested to see how much more adoption fault tolerance gets in vSphere environments now that you can scale those VMs and actually perform VM backups (couldn't snapshot them prior to vSphere 6, which is a core part of that process).

I also agree that you need to let the business drive what infrastructure solutions are used and not the other way around. Too often you see issues with SLAs from forcing a technology that isn't right for the job.

Virtualization has spawned something new with DR as well - recovery assurance through automation. It seems this is a big focus of vendors both on-premise and in the cloud. This was tougher to do in physical environments.
Cancel
The issues have been concisely and clearly set out.

Although principles such as RPO/RTO are as relevant as ever there might now be the tendency in planning to forget them and a general reliance on newer SAN solutions and data center infrastructure e.g. active /active.
Snapshot management clearly needs special focus within DR planning in the virtual environment.
Cancel
It's hard for me to see that a metro cluster could be used for disaster recovery, or even for business continuity. Something as minor as a power failure could affect all the servers if they're that close together.
Cancel

-ADS BY GOOGLE

SearchCIO

SearchSecurity

SearchNetworking

SearchDataCenter

SearchDataManagement

Close