Demystifying VMware Site Recovery Manager and its role in DR

VMware Site Recovery Manager (SRM) simplifies disaster recovery, but managing it is tricky if users don’t know how it handles VMware snapshots and how it affects backup software.

VMware Site Recovery Manager (SRM) is a package that simplifies disaster recovery (DR) in an organisation’s virtual guest operating systems and improves the success of DR processes. It also manages the synchronisation of an organisation’s Virtual Center data (guest VM configurations) between the primary and backup site.

But there’s much more to VMware SRM than meets the eye. In this tip, virtualisation expert Mike Laverick demystifies some of its complexities and shows how SRM administrators can manage this important tool.

How Site Recovery Manager handles VMware snapshots

As many virtualisation administrators are aware, snapshots form the bedrock of virtual machine backup. Many users want to know how SRM handles snapshots and what the impact on backup software is.

It is important to note the way VMware snapshots are handled and if they are a concern with or without VMware Site Recovery Manager. After all, for many virtualisation professionals, offsite backups remain a valuable recovery strategy so long as the time it takes to restore the data meets their recovery time objective (RTO).

Snapshots form the bedrock of virtualised backup because they are needed to “unlock” the files that make up the virtual machine (VM) in the file system. Without snapshots, backup admins cannot touch these files when the VM is powered on.

Snapshots are created by combining two files -- a copy of the VM’s contents and a “delta” VMDK file. Whilst a snapshot is engaged, all disk operations are redirected to these “delta” files that grow in size during the backup process. Snapshots have always been a bit of management headache because if you create them manually or they’re not handled correctly by your backup vendor, then they can keep growing and take up disk space.

What’s important to know is that the snapshots are not supported with replicated VMs using VMware vSphere Replication (now a part of Site Recovery Manager 5). If admins protect a VM with vSphere Replication, no snapshot data will be available after the recovery. In contrast, when VMs are protected with array-based replication, snapshots are replicated across arrays. So there’s a clear schism between the two replication processes. If a VM were protected using replication from a storage vendor, one would expect the recovered VM to have a snapshot.

Managing transient data and replication using SRM

Learn strategies for working with transient data using DR technologies including VMware Site Recovery Manager, VirtualSharp and Zerto

According to VMware’s official admin guide for VMware SRM, “array-based replication supports recovering snapshots and linked clones, but they have limitations based on CPU types. (vSphere Replication does not support recovering snapshots and linked clones.)”

VMware advises that if virtualisation admins want support for using certain types of VMware Consolidated Backup (VCB) snapshots at the recovery site, then the ESX hosts at both sites must have compatible CPUs, “as defined in the VMware knowledge base articles VMotion CPU Compatibility Requirements for Intel Processors and VMotion CPU Compatibility Requirements for AMD Processors”.

Data backup best practices

My recommendation here would be to ensure that users copy the backup data to their DR location, so it will be available to the organisation in case of a DR event. It isn’t advisable to depend on replication alone in case you experience a significant corruption of data.

Additionally, if users have VMs that will be running in the DR location for longer than expected, they will want to quickly re-establish the backup schedule. Without the original backup data they would be forced to back up all the recovered machines from scratch - that will take up storage space – and more importantly time. As for the snapshot data, it’s unlikely that a backup vendor will be automatically handling that unless users have also failed over the original backup system (backup jobs, metadata, backup files) to the DR location.

Another tip is to speak to your backup vendor about their best practices. Check if you can use the vendor's software if a DR event occurs. Some vendors recommend “importing” backup data into hot standby at the DR location, and some recommending that you include recovering the backup management system as part of your recovery . Admins may need a tool such as Alan Renouf’s “vCheck” script to monitor these VM snapshots.

The vCheck script analyses your vSphere environment and integrates it for popular problems and errors. If the vCheck script discovers orphaned snapshots you could consider committing the snapshots using PowerShell, if the backup software does not remove them as part of its regular backup schedule.

How VMware SRM handles Storage VMotion (SVMotion) events

It’s important to know that neither Storage VMotion (SVMotion) nor Storage DRS (SDRS) is fully integrated with SRM. This doesn’t mean users can’t use SVMotion or SDRS, but users should be fully aware of the consequences when using them. It is all about the unforeseen consequences of improperly planned SVMotion usage. Unlike its older brother VMotion, SVMotion can take time. After all, it has to move gigabytes (GB) of data from one data store to another. Here are a few consequences that users must consider:

  1. Moving a VM from one data store to another could move the VM to volume or logical unit number (LUN) where no replication is configured. This can leave the migrated VM completely unprotected.
  2. Moving a VM from one data store to another could mean that the frequency of replication could change massively -- from a datastore replicated every 15 minutes, to one that’s only replicated once every 24 hours. This can easily compromise protection for mission-critical workloads.
  3. Moving a VM from a data store that isn’t replicated to one that could generate a burst of replication traffic as the array does a full copy of all the changes to the LUN or volume. This will cause a noticeable hit to network and storage system performance and potentially trigger performance alarms from any third-party systems management tools in the environment.

At the moment there are no warnings or alerts from the vSphere Client of these events. So before embarking on SVMotion, users must think about which of these scenarios affects them:  

Figure 1: The “remove protection” button can be used to unregister the protected VM prior to the SVMotion.

Scenario 1 requires the SRM administrator to remove protection from the VM before the move and refresh SRM to remove it from its list once the SVMotion was completed as seen in Figure 1.

Scenario 1 is more likely to happen because an administrator hasn’t thought through the consequences of the SVMotion. After all, once a VM is protected by replication, it’s unlikely that you would want to remove it -- unless it was a mistake or the VM was being decommissioned and archived.

Scenario 2 requires you to unprotect the VM in Site Recovery Manager, before carrying out the SVMotion, and then re-protect it once it has arrived at the location. But bear in mind it may be sometime before the VM is fully replicated to the DR location dependent on the frequency and available bandwidth to complete the synchronisation.

Scenario 3 is relatively simple: The VM can be treated as a VM requiring protection from SRM. Again you may have to wait for sometime before the VMs files are replicated to the DR location.

It’s up to the administrator to manually manage the SVMotion process – understanding both the consequences and the steps required to keep Site Recovery Manager up to date with the changes taking place. For this reason although it is advisable to use SDRS for the “initial placement” of the VM, I wouldn’t recommend turning on its capacity to initiate a SVMotion to improve the VM disk performance.

Managing swap files and temporary data with VMware SRM

Figure 2: You can change the default location of the VM swap file. By default it is created wherever the .VMX file is located – which means it could end up being replicated unnecessarily.

The general recommendation is to relocate swap files to data stores that are not replicated to make sure they don’t unnecessarily chew up bandwidth between the protected site and the recovery site. It’s generally accepted that relocating the VM swap file is a relatively trivial task, given it is so easy to do from the vSphere Client as seen in Figure 2.

As for the swap file that resides inside a VM this is much trickier. Mainly because how this is managed varies significantly from site to site. Some sites relocate the swap file in Windows to a P: drive as standard, while some other sites choose not to do this at all. 

Users should start with a company review regarding the configuration settings initially, and attempt to agree on an organisation-wide approach. If you relocate the swap file of a VM to a non-replicated data store when you protect the VM for the time, you will receive an error stating “Device Not Found: Hard disk”. You should find -- if you edit the settings of the VM icon in Site Recovery Manager -- that virtual disk which is not being replicated will be marked as "not replicated".

Figure 3: If all files that make up a VM are replicated, the VM is protected when the Protected Group is created. But, if only parts of the VM are being replicated you will see an error message.

The correct procedure here is to use the “Detach” button as seen in Figure 3 to remove the device. If it is not detached when the VM booted in the Recovery Site, it would try to find the virtual disk in its original location (in the above example, the data store is called “infrastructureNYC”). Bearing in mind there’s likely to be no communication path to allow this - and the Protected Site could be a smoking crater - there’s little point in retaining this mapping. Remember if a VM attempts to spin up and it cannot find all its virtual disks, then the event will fail completely.

It’s important to consider the consequences here. If a Windows VM powers on expecting its swap file to be P: and it isn’t there, then the default behaviour is for the swap file to be created on the C: drive instead. So make sure there’s plenty of free disk space in C: to make this work.

There is more to managing VMware SRM than meets the eye – especially when you get into the nitty-gritty of production use with all the complexities of different customer configurations.

Mike Laverick is a former VMware instructor with 17 years of experience in technologies such as Novell, Windows, Citrix and VMware. Since 2003, he has been involved with the VMware community. Laverick is a VMware forum moderator and member of the London VMware User Group. He is also the man behind the virtualisation website and blog RTFM Education, where he publishes free guides and utilities for VMware customers. Laverick received the VMware vExpert award in 2009, 2010 and 2011.

Since joining TechTarget as a contributor, Laverick has also found the time to run a weekly podcast called the Chinwag and the Vendorwag. He helped found the Irish and Scottish VMware user groups and now speaks regularly at larger regional events organised by the global VMware user group in North America, EMEA and APAC. Laverick published books on VMware Virtual Infrastructure 3, vSphere4, Site Recovery Manager and View.

Read more on Virtualisation management strategy

Data Center
Data Management