In this concluding part of a two-part series, Computer Weekly looks at ways of testing disaster recovery (DR). In the first article, we discussed the need for disaster recovery and for developing a strategy to test the backup process.
We discussed four main items that need to be evaluated to ensure successful testing. These were:
- Time – Evaluating the time since a test was last performed and measuring the time to complete recovery, from a RTO (recovery time objective) perspective.
- Change – Testing after major changes occur in the infrastructure, such as application upgrades or infrastructure (hypervisor changes).
- Impact – What is the impact of running a test? Can a test be run without impacting the production environment?
- People – How do we consider the human factor from the perspective of taking human error out of the recovery process?
In a virtual environment, the options for recovery can be divided into four main sections.
Hardware-based replication provides a well-established process to implement a disaster recovery strategy. Virtual machine (VM) data is replicated between two arrays in synchronous or asynchronous mode, at the LUN or volume (file) level.
In a VMware environment, Site Recovery Manager (SRM) provides the capability to manage the failover process and do automated failover testing.
Typically, SRM provides non-disruptive capabilities by using volume snapshots taken from the remote (or target) storage array at the recovery site. These snapshots are used to instantiate copies of production virtual machines at the DR site.
SRM handles the creation of an isolated network to validate the application without impacting production. Alternatively, a dedicated test network can be created and used in the testing process. This provides more scope to create a realistic test environment.
Tools such as SRM can be used to frequently test applications, but there’s a significant manual aspect to the testing, and resources (storage and hypervisor) need to be available at the disaster recovery site to complete the test process.
It’s worth remembering that replication covers all virtual machines (VMs) on a single LUN. VMware only recently made replication of individual VMs possible in vSphere 6.5 and what’s being called VVOLs 2.0 (VASA 3.0).
At the time of writing, there appear to be no suppliers that offer hardware support for VVOLs replication other than Tintri, which implements this as a proprietary feature.
The hypervisor can be used to implement disaster recovery through features such as changed-block tracking. This provides an application programming interface (API) that allows backup product access and to copy changed data on a per-VM level.
Veeam’s SureBackup feature in Veeam & Replication 9.0 provides the capability to recover a virtual machine into an isolated environment for testing purposes. The data is derived from standard image-based backups.
The test process instantiates VMs directly from the virtual machine repository without the need for additional storage and checks a range of metrics to ensure the VM is valid. Once the test is complete, the virtual machine is decommissioned and a success/fail report generated and sent to the backup administrator.
In Microsoft Hyper-V, failover testing can be performed directly using Hyper-V Replica.
The Replica process maintains copies of production virtual machines into a secondary location. The “Test Failover” option within the “Replication” menu (or through PowerShell) for an individual virtual machine causes the creation of a test VM at the recovery site, suitably appended with a custom suffix to indicate the status as a test machine.
Once created, the administrator can power up the virtual machine and perform testing of the application, remembering to put the VM onto an isolated network (this is not done automatically).
Replica recovery testing only allows for one test instance per production virtual machine, so administrators have to ensure manual cleanup is performed. The availability of PowerShell for Hyper-V provides the option to fully script a disaster recovery testing process that can be run against individual virtual machines without having to execute a series of manual tasks.
Another system for taking backups through the hypervisor is to use a dedicated virtual machine to handle data traffic.
This is the product used by Zerto, which places a proxy virtual machine on primary and secondary VMware clusters and effectively acts as a splitter to the data traffic as it is read and written by the host. Write input/output (I/O) is replicated to the remote site or the public cloud, from where the application can be recovered from failure or tested for recovery.
The VM hypervisor system is also used by Datto, a data protection company. In this instance, data is protected on a local physical appliance and replicated to Datto’s backup cloud infrastructure, from where the application can be started to test or real recovery.
Druva, another data protection company, also provides the ability to restore and recover into the public cloud. Druva’s technology is capable of injecting drivers into the virtual machine image, which enable the VM to be booted in an environment such as Amazon’s Amazon Web Services (AWS).
Replication into the public cloud is a powerful solution, both for disaster recovery and DR testing. Customers don’t need to retain hardware assets in a remote location and can simply pay for the time they operate in “DR Mode”.
From a testing perspective, the public cloud offers the capability to test on-demand, with costs associated only with the time for which resources are used.
Secondary storage systems
Over the past few years, a number of companies, including Rubrik, Cohesity and Actifio, have released products that address the need to manage copy data and backups.
These systems allow what is called “secondary data” to use backup images for other purposes, such as seeding test/dev environments and enterprise search and discovery.
These platforms can also be used for disaster recovery and testing DR by recovering virtual machines that run directly off the secondary storage platform. Currently, this has to be done as a manual process because no supplier currently offers automated recovery testing in their products. However, most offer API access to their platforms, so recovery testing could be developed as a scripted process.
The benefit of using a secondary platform is that the systems are built specifically to allow data recovery with minimal or no impact to the process of taking backups. This means testing across many virtual machines can be performed without worrying that production data protection is being affected.
We can see from these system options that the implementation of disaster recovery testing has a specific set of requirements. These are:
- The ability to copy/replicate the virtual machine image to a secondary location. This is typically part of the existing disaster recovery process.
- The ability to isolate the VM from the production network and run it on a network used for testing only.
- The ability to validate the status of the restored application. This will always need to be more than purely booting the virtual machine.
These features are table stakes in building a testing plan, but the ultimate system will be continuous disaster recovery testing capability.
Companies such as Continuity Software (which offers AvailabilityGuard) provide the capability to automate the disaster recovery testing process and run testing on demand.
This becomes important as IT organisations move towards DevOps methods of application deployment, where the application may be changed multiple times per day.
The integration of DR testing and DevOps processes is perhaps immature at the moment, but represents one area where suppliers can start to add value to their existing products.
Read more about disaster recovery testing
- Disaster recovery provision is worthless unless you test out your plans. In the first of a two-part series, Computer Weekly looks at disaster recovery testing in virtualised datacentres.
- Learn how to develop disaster recovery strategies as well as how to write a disaster recovery plan with these step-by-step instructions.