Unplanned networking outages: How vCSHB can help

Our VMware expert explains the features of vCSHB and tests the unplanned loss of VMware channels and the principal public network.

Editor's Note: This is part four of a vCenter Heartbeat series from Mike Laverick. For the rest of this series, check out part one, part two and part three.

Test 5: Unplanned loss of both the VMware channels
If the VMware channel fails for whatever reason, failover is not triggered and the primary active vCenter server just carries on running. Of course, applications such as the vSphere client and other services such as VMware View are unaffected. In vCSHB, the principal public network cannot be configured to be as a backup to the VMware channel.

However, an alert does appear to indicate that the primary can no longer replicate its changes to the secondary server, and a question mark is placed over the secondary to indicate to the administrator that there is a network problem.

Figure 1 (Click image for an enlarged view.)

Before the VMware channel is restored, it is important to check that the passive server is correctly configured to understand which server is active by using Configuration Manager before restarting the Neverfail R2 service. Once replication begins, then a system check is made to check the integrity of the file system. You can monitor this consistency by checking from the Data Tab, and using the Replication option.

Figure 2 (Click image for an enlarged view.)

Test 6: Unplanned loss of both the principal public network and the VMware channel
The next test I did was a failure of all networking to the primary vCenter. It's worth saying that this situation is very unlikely to happen. It would need the failure of four physical NICs, assuming that both the principal public network and VMware channel were protected by teamed NICs. Alternatively, it would take the simultaneous failure of two core switches at roughly the same time. Many folks would say the network outage itself would then be the real concern, rather than its impact on technologies like VMware HA or vCSHB.

If not managed correctly by the administrator, when the network is restored this is likely to create a "split-brain" situation. In this case, failover happens to the secondary vCenter, but when network connectivity is restored to the primary, both the primary and secondary vCenter can believe they are they are the active vCenter server. In the screen grab below, I deliberately create a split-brain situation by not following the correct recovery procedures.

As you can see below, both the primary and secondary believe themselves to be active. You can see that the failover was successful to the secondary vCenter server. You can tell this because the secondary's status indicates it is "active following failover."

Figure 3 (Click image for an enlarged view.)

vCSHB detects this split-brain scenario by looking for a special marker or a flag. In the other tests, the vCenter server that "failed" is normally marked as passive when either a planned or unplanned outage takes place. In the case of failure of both network channels, this doesn't happen; the server just "drops off" the network" When network connectivity is restored, each server believes that it is the "active" partner. Clearly, two active systems are not tolerated, and it's safer for vCSHB to force a split-brain condition in effort to stop data loss from occurring.

As I went from doing nothing with protecting vCenter to vCSHB, it was certainly a step up!


Mike Laverick, Contributor,

If split-brain is detected, vCSHB by design stops the vCenter service on the secondary server. It then marks both the primary and secondary vCenter servers as being "passive." The packet filter inside both primary and secondary stops all network communications. This enables the administrator to decide which node should be made active.

In a LAN configuration, it is probably less significant to indicate which server is active, although personally I would always go for the vCenter that was selected in the failover. In a WAN configuration, this scenario is little bit more complicated, as there may well be a built-in latency between the active and passive roles. If the administrator made the role active on the vCenter server that was holding the "older" data, the replication engine could theoretically over-write a more recent version of the vCenter system.

In other words, getting this process wrong might mean losing data from the vCenter database. For this reason, the system makes the process a manual rather than an automatic process, so a human operator can intercede and take the correct steps. To do this, you need to stop the vCSHB service on the server you want to set as being passive and then run the Configure Server utility, which allows you to manually set the correct state. In the screen grab below, I'm telling the primary vCenter that the secondary is the current active server.

Figure 4 (Click image for an enlarged view.)

Once set, I would then start the Neverfail R2 service on the primary, confirm it had started correctly and restore network connectivity. This would avoid the split-brain scenario from taking place altogether.

If you consider tests 4, 5 and 6, it's perhaps salutatory to think how easy it is to utterly disconnect a virtual machine (VM) from the network with a few simple clicks of the mouse. In the physical world, I would have to locate network cables and pull them out completely to achieve the same result. It's so worth saying that for many VMware admins in very large organisations, the network layer may well be outside of their control. You could find yourself vulnerable to vCenter outages when the service is working perfectly fine.

Additionally, let's say a network disconnection takes place because some person has been fiddling with network cables at the back of the server, and they incorrectly unplug two cables that make up the team responsible for the principal public network or the VMware channel. The natural thing for them to do, having realised how stupid they have been, would be to plug them back in. That could be quite dangerous. In the time they were unplugged, a failover could well have occurred. Plugging those cables back in they could create accidentally a split-brain situation. I imagine such an individual wouldn't be the most popular guy in the server room.

I've no doubt that vCSHB delivered greater uptimes than I previous enjoyed or in some cases endured, as previously I had only protected my virtual vCenter by putting it in a VMware HA cluster. With VMware HA, you are only protecting the VM from an ESX host failure. It does nothing to protect you from failure of the services within your operating system. So as I went from doing nothing with protecting vCenter to vCSHB, it was certainly a step up! And of course, if you still insist on running vCenter on a physical machine, VMware HA offers no protection whatsoever.

From my discussions with customers who have used vCSHB for some time, it's clear that the new vCSHB 6.3 offers improvements in setup and configuration over the previous release. But it's clear I benefited from being able to easily create my secondary vCenter by using VM cloning. My life would have been much harder had my vCenter been a physical box.

Some customers do have very high expectations of vCSHB. They perhaps naively think that the setup and post-configuration will be a simple next-next-finish exercise. As this article has shown, some work needs to be done to validate the configuration and test it properly, so you are aware of the different outcomes.

No vCenter? No backup! The same applies to VMware's new vCloud Director. No vCenter? No vCloud!


Mike Laverick, Contributor,

Additionally, recovering from various vCenter outages once vCSHB has been implemented needs an administrator who is able to properly understand the relationship between the active/passive and primary and secondary servers, to avoid split brain situations occurring, and to prevent accidental outages from happening through operator error.

As ever with all availability software, it's the abilities of the administrator that will affect the quality of service it provides. I would have to admit that sometimes I caused problems for myself by accidentally carrying out the right task on the wrong vCenter Server!

At the moment, vCSHB is the only supported way of protecting the vCenter service. However, it's regarded by some VMware customers as an expensive add-on to the core vSphere purchase, and I think more customers would use the technology if it was rolled in to one of the top-level vSphere SKUs. I think that would make VMware customers more willing to adopt vCSHB.

Another factor is where your environment has a significant number of high-level services that depend on vCenter availability. If you mainly run vCenter as a management service without other service dependencies, it becomes hard to construct a usage case that justifies the cost of vCSHB. As ever with availability software, it's very much dependent on the perceived cost of the outage, whether customers feel the pressing need for technology like vCSHB.

It's perhaps important to remember that everyone back ups their VMs, and most VM backups depend on the vCenter being available for VMs to be accessed during the backup window. The equation is simple for a lot of VM backups. No vCenter? No backup! The same applies to VMware's new vCloud Director. No vCenter? No vCloud!

I think this article demonstrates a copy of key "takeaways." Firstly, if you are considering implementing vCSHB, then you need to correctly set the expectations in the minds of the users who might be affected by the downtime caused if no such solution is in place.

I think more customers would use the technology if it was rolled in to one of the top-level vSphere SKUs.


Mike Laverick, Contributor,

So it's an unreasonable expectation to think that if vCenter fails for whatever reason, that the failover process will be instantaneous. A failure of the primary vCenter is going to be noticeable to anyone using the vSphere client. As I've said before, the focus shouldn't be on the vSphere client impact; the real reason for including vCSHB is to protect the vCenter service dependencies. As we saw with the View 4.5 example, there is no absolute cast-iron guarantee that if vCenter fails, it will be restarted quick enough to remove the need for remedial action elsewhere.

Finally, with this in mind, I think the correct away to view vCSHB is that it offers better availability to the services that previously had none at all. In the absence of any built-in resiliency in the VMware management platform, vCSHB is certainly worth considering.

Secondly, if you intend to deploy vCSHB, you need to ask yourself some key questions. The answer to these will then determine your deployment steps. These include:

1. What machine type are you going to use? Two physical vCenters (P2P)? One physical vCenter with a secondary virtual vCenter (P2V)? Two virtual vCenters (V2V)?
This question is important because each type comes with its own prerequisites, and a P2P solution will involve more work in building up the two identical vCenter servers if they run on a physical system.

The model you adopt will also determining your cloning options. With the V2V model, you can use the simple clone feature in vCenter. With the P2V model, you can use VMware Convertor to carry out a P2V of the existing physical vCenter to create your secondary vCenter.

Finally, with the P2V model you can use Microsoft Backup and vCSHB's own internal cloning process. By definition, the P2V model is a much harder configuration to get right.

2. Are your vCenter and database held separately or the same Windows instance?
If vCenter and its database are held separately, you will have to install vCSHB twice -- once to protect the primary vCenter and again to protect the back-end database. To some degree, having vCenter and the database together simplifies the setup of vCSHB, but it could introduce scalability issues with vCenter in doing so.

3. LAN or WAN configuration?
If its LAN configuration, the primary and secondary vCenter reside on the same network and you are creating an availability solution for vCenter. Once you stretch the vCSHB product to include more than one geographical location, you are then trying to bend and twist the technology to be more of a disaster recovery (DR) solution.

Care needs to be taken with the IP data used in the DR mode of vCSHB, and you may need to consider factors such as different IP ranges in use at the two sites and dynamic DNS updates to ensure the service is advertised correctly at failover.

Mike Laverick

ABOUT THE AUTHOR: Mike Laverick is a professional instructor with 15 years of experience with technologies such as Novell, Windows and Citrix, and has been involved with the VMware community since 2003. Laverick is a VMware forum moderator and member of the London VMware User Group Steering Committee. In addition to teaching, Laverick is the owner and author of the virtualisation website and blog RTFM Education, where he publishes free guides and utilities aimed at VMware ESX/VirtualCenter users. In 2009, Laverick received the VMware vExpert award and helped found the Irish and Scottish user groups. Laverick has had books published on VMware Virtual Infrastructure 3, VMware vSphere 4 and VMware Site Recovery Manager.

Read more on Server virtualisation platforms and management