It's easy to see how this situation could have arisen. Talk to any virtualisation administrator and the idea of using traditional backups and tapes seems like a distant memory.
Since the early days of virtualisation, we've seen the introduction of high-availability clustering and features such as vMotion that seem to resolve the majority of data protection requirements.
But, look a little closer and we discover that these claims are somewhat exaggerated.
Defining disaster recovery
It's always good to establish a baseline set of definitions, and that's imperative in disaster recovery planning -- to which you can find a comprehensive guide here. BC and DR are slightly different in their definitions:
- Business continuity describes the process of ensuring the business continues to operate.
- Disaster recovery describes the process of bringing back services after a disaster has struck. In most instances, we'd want to avoid that scenario as much as possible, or at least minimise it.
There are two other definitions widely used in disaster recovery/business continuity, and those are recovery time objective (RTO) and recovery point objective (RPO) -- both establish a service level for data recovery:
- Recovery time objective determines the expected or target time required to recover business back to normal operations. An RTO of zero implies services must be restored immediately, or have no outage. An RTO of one hour requires the application to be restored to normal use within an hour.
- Recovery point objective expresses the point in history to which that service should be restored. An RPO of zero implies the service must carry on from where it left off with no loss of data (imperative for a banking application). An RPO of 24 hours implies yesterday's data is good enough (acceptable for a test/development environment).
Building a disaster recovery plan
In an ideal world, all applications would be restored with RTO = zero and RPO = zero, but the practicalities and cost of this would be untenable.
Instead, the starting point in any BC/DR plan should be to establish the disaster scenarios, rate their importance and impact, and apply that to each application. For example, failure scenarios include:
- Loss or damage to systems (fire, power failure, flood, earthquake).
- Inability to access facilities (fire, flood, hazard -- chemical, radiation).
- Criminal or malicious damage (disgruntled employees or cyber attacks).
- System or application failures (software bugs, failed upgrades, data corruption).
We can rate each application impact of failure and use that to determine our RTO/RPO service level objectives (SLOs). Again, here are some examples:
- Email system: Impact of downtime -- high; RTO = 30 minutes, RPO = zero.
- Core banking app: Impact of downtime -- critical; RTO = zero, RPO = zero.
- Overnight reports app: Impact of downtime -- low; RTO = four hours, RPO = 24 hours.
In today's always-on world, most customer-facing apps are expected to run 24/7. This can skew some of the SLOs, but application design can minimise this by separating out front-end access from back-end features.
Clearly, whether an application is virtualised or not, and independent of the technology, the business requirements for recovery have to be established first. In fact, the business should always provide guidance on their requirements rather than have IT impose standards, which has been the traditional mode of operation.
How virtualisation can help disaster recovery
Virtualisation abstracts the physical resources of the server into logical constructs that represent hard drives, network cards and disk controllers. Processors, memory and network ports are represented by parameters in a configuration file and hard drives are represented by files on local or shared storage.
Therefore, backing up a virtual machine (VM) is simply a case of taking a copy of the files and the configuration data. In addition, moving a virtual machine to alternative hardware can be achieved even if the physical hardware isn't identical. This makes it much easier to manage hardware failure issues. Virtualisation features solve the problems of BC/DR in the following ways.
Recovery is often based around the need to create a backup and restore from those copies in a disaster recovery scenario. To meet this need, hypervisors provide features that allow backups to be taken by copying VM contents. To ensure there is data integrity, virtual machines run software agents or tools that quiesce or suspend I/O while a copy of the VM files is taken.
A simple backup can be used to provide recovery at file, application or VM level, depending on the sophistication of the backup software. Backing up an entire VM snapshot can be impactful so some systems allow the snapshot capabilities of the storage system to be combined with hypervisor snapshots to offload the processing work while maintaining data integrity.
Although not strictly a data recovery process, the ability to move virtual machines dynamically between physical hardware provides the capability to reduce the impact of hardware failures. VM migration doesn't protect against server failure, but can be used to move VMs when partial failures are experienced, either in the server or other components (such as the network).
VM migration can also be used as a controlled process when services have to be moved off a piece of hardware (for maintenance) or to mitigate the risk of a datacentre failure (such as an impending storm or hurricane that might affect datacentre operations). In this sense, VM migration is more akin to business continuity, ensuring that servers continue to run, even in the event of a potential or actual incident.
High availability/fault tolerance
These are features of the hypervisor that enable a virtual machine to run in case of a hardware failure. Two levels of service are provided. High availability monitors virtual machines and will restart them on alternative hardware in the case of a server failure. This results in a small outage as the application restarts. The other runs a ghost VM image on alternative hardware, instigating that image as the production service in the case of server failure, typically with no application outage.
The ability to use these features may require shared storage hardware (to store the VM configuration and data) and will, of course, be chargeable. Some suppliers support the ability to use array-based replication in conjunction with high availability/fault tolerance features. This allows the hardware configuration to span a short distance (hundreds of metres) and create a metro cluster. Metro clusters mitigate datacentre outages or serious hardware failures without the need to deploy complex application clusters.
Some third-party applications use virtualisation to intercept write I/O for a virtual machine and create a remote backup image. This process occurs asynchronously and provides a failover copy of the VM that can be powered up and used in the event of a failure at the primary site. The RPO of an application using this kind of system will be dependent on the speed at which data can be moved off-site.
As discussed, virtualisation provides benefits through the abstraction of physical hardware components. One other consideration in implementing disaster recovery/business continuity is to build recovery capabilities directly into the application itself.
Application resiliency is achieved by running many instances of the application, each of which can fail and be restarted on alternative hardware. This kind of design isn't directly dependent on virtualisation, but can work well where multiple hypervisors and hardware configurations are implemented. In the future, we will see BC/DR resiliency implemented using containers, a form of application virtualisation that is at the beginning of widespread adoption.
With reference to disaster recovery/business continuity principles such as RTO and RPO, we can apply recovery options to the requirements of the application. Some applications will get full high availability/fault tolerance, whereas others may simply be backed up using hypervisor snapshots. In some instances, continuous backups or full high availability/fault tolerance with array-based replication can be justified. It's simply a case of applying the technology to the requirements.
Read more about disaster recovery
- In this series of articles on IT disaster recovery policy, we'll walk you through the process of developing disaster recovery provision, from initial risk assessment to maintenance and continuous improvement of your plans.
- Jon Toigo argues against virtualisation advocates that say the software-defined datacentre, with its high availability and clustering, does away with the need for disaster recovery.