Disaster recovery for virtual servers

Server virtualisation helps make disaster recovery easier by representing everything in logical terms and doing away with the need for a physical replica of your environment.

By Cath Everett, contributor

The rise of virtualized x86 servers is changing the disaster recovery (DR) process, making it easier and more cost effective while requiring more careful planning .

A traditional disaster recovery setup – with a second entirely physical environment – is expensive and often requires companies to prioritise the machines they most need to safeguard, leaving others exposed. Server virtualisation helps make disaster recovery easier by representing everything in logical terms and doing away with the need for a physical replica of your environment.

More on disaster recovery
Disaster recovery for data storage environments

Data asset inventory key to disaster recovery planning and design

Disaster recovery technology guide

A medium-sized organisation with 100 servers, for example, could expect to pay as much as £150,000 per annum for DR services and only cover a small percentage of its servers, estimated Andrew Cooke, a principal consultant at City of London-based services provider Intercept IT. But with virtualisation they can cover everything for the same cost.

"In traditional DR arrangements, companies often only protect about 20% of their servers – they'll pick and choose due to the cost involved. But in a virtualised environment, we say 'why not cover everything because it won't cost you any more?'" Cooke said.

So while using virtual servers improves the DR process, it also changes it and makes planning more crucial.

DR in a physical world

Traditional methods of recovering from disaster have often relied on moving backed up data to suitable hardware at a DR site. Physical servers at the second location have either been ready and waiting – hopefully fully configured OS- and apps-wise and ready to have backup data restored to them – or are brought in by a DR service provider and set up as part of the recovery process. Both methods are usually time-consuming.

"You generally need a reasonably large number of highly skilled people to affect recovery and those resources may not always be at hand," said Rupert Green, virtualisation specialist at IT services provider Logicalis. "And you'll generally find that people struggle to recover everything, particularly within the timescales the business expects."

Even if data is replicated from a primary to a secondary site over a WAN, or the network is configured to undertake failover, you need similar or identical hardware at both facilities. These also have to be configured and updated in lockstep with each other, a task that can become a real – and expensive - headache for large enterprises with hundreds of servers.

DR in a virtualised environment

A real advantage of undertaking DR in a virtual environment is that recovery procedures become less knowledge- and more process-based.

This is because the whole shebang – data, applications and operating system – is encapsulated into a virtual machine (VM) that is hardware-independent and data resides on shared storage. As a result, you can copy VMs and data by traditional backup for bare metal restore in a disaster scenario.

Or, better still, you can replicate everything at regular, pre-defined intervals over a wide-area network to a second server and storage environment at a remote data centre and invoke failover.

Upon failover, the replication process is halted to prevent data corruption and storage at the secondary site is designated as live. The hypervisor then scans the array for LUNs, sucks the VMs into inventory and turns them on in a pre-defined order so that they can be run up on local servers. Because the process is automated, it reduces the amount of day-to-day administration required to keep systems at remote sites configured correctly.

Logicalis' Green said, "The process for recovering each of your VMs is the same so you generally have a much higher chance of success and you don't need a lot of technical skills to effect failover."

As long as one staff member understands the process and clearly documents it, he added, "you can gather the troops to work on other stuff such as internet connectivity to ensure you can recover within your SLAs."

Legal firm Kennedys went down the virtualisation route in July 2008 with the help of Dell Consulting Services as part of an office relocation and infrastructure upgrade. It introduced FalconStor Software's Network Storage Server, Application Snapshot Director for VMWare and HyperTrac Backup Accelerator to undertake data replication at the transaction level.

Kennedys had previously used a managed backup service, but found that costs were spiralling due to a growing use of email.

"DR in a virtualised world is much easier if it all happens by disk-to-disk replication," Kennedys IT director Ian Lauwerys said.

Testing disaster recovery in a virtualised environment

Disaster recovery testing also becomes easier in a virtualised world. Cloned VMs can be used to test failover and bring the production environment up at a secondary site. This type of testing tends to be carried out infrequently with physical servers because of the time required and the disruption to the business.

VMware's Site Recovery Manager (SRM) tool makes it even easier to test. SRM integrates with the storage management layer and automates cross-site failover. It also generates reports for auditing purposes.

"You need to test that failover works, and it's an ongoing thing so we revisit it every month to see that everything's working as it should be," said Jonathan Bruce, head of IT at intellectual property management firm Rouse. "And SRM helps because it allows you to do a 'fire-drill' failover."

Bruce said failover at Rouse now takes less than 60 minutes, which is within the organisation's recovery time objective of 12 hours. Before beginning planning DR for its virtualised server estate in September 2008, Rouse simply backed up its core systems to tape before storing the media offsite.

DR planning

A good DR plan is critical for protecting a virtualised environment, because there are issues that don't exist with physical servers.

Rouse's Bruce says one key consideration is working out and taking into account dependencies between applications. "Dependencies are always there and it's an issue that you still have to deal with, with or without virtualization," he said.

That means in a virtualised world, it is necessary to replicate or back up dependent systems simultaneously to ensure no data inconsistencies result at the restore stage. Replication technology for virtualised environments should also be integrated at the hypervisor and application level to ensure consistency during restores. Vendors such as EMC and NetApp currently provide such application-aware offerings, but most others are still working on it.

The order in which systems are brought up should be prioritised by placing the various VMs that make up a service into the same LUN to enable failover and/or recovery within designated recovery time and recovery point objectives (RTOs and RPOs).

"You need to understand the requirements of the business and their expectations and try to match that to your technology," Bruce said. "So, you have to provide realistic time-scales for recovery or everyone will be unhappy. Our core systems are pretty much replicated whenever there's a change to the data so our recovery time is now about four hours."

DR can be provided using active-active data centre configurations, which undertake production load sharing between primary and secondary sites, as well as active-passive ones. In the latter instance, production systems can be recovered or failed over to development or standby machines at a secondary site or to an outsourced facility.

If you use an active-active configuration to enable failover in both directions rather than an active-passive one, which only enables failover in one direction, you must ensure that a continuous replication product is in place at both sites.

"The design has to take into account your storage architecture and how you want to fail your services over," Logicalis' Green said. "So you need to look at what you've got today, how you want to deliver the service, failover requirements and timings because you need to be able to prioritise."


Read more on Disaster recovery