vCSHB failover tests: Avoiding split-brain

Our VMware expert experiments with vCenter Server Heartbeat and explains how the service can help with failover and unexpected vCenter outages.

Caveat: With any failure, there is always a possibility of creating a "split-brain" situation if you fail to follow the right post-failover procedures. This is especially true in the case of network outages.

Split-brain is a situation in availability solutions where the primary and secondary nodes both erroneously believe they are the "active" server. It's best practice to investigate why the outage took place and take the steps to make sure that split-brain doesn't occur. It's not a case of simply plugging the network back in or rebooting and naively expecting vCenter Server Heartbeat (vCSHB) to fix itself.

vCSHB's first job is to protect the integrity of your data. If at any stage it thinks split-brain could occur, it will shutdown both the primary and secondary services and set both vCenters to be passive.

The correct way to deal with any failure is to take the problem server offline, perhaps isolating it from the network, and stop the Neverfail R2 service (if it hasn't stopped already). Then use the Configuration Manager utility to ensure awareness that the partner server is now the "active" node. As a result, once the failed server is reconnected to the network and the Neverfail R2 service is restarted, it joins as the "passive" server.

Once I had vCSHB configured, it was time to see what level of availability it could really deliver. I decided on a series of tests designed to put the product through its paces. Each test would become progressively more aggressive and critical. Each time I did a test, I would failback the vCenter service, to make sure the primary was the "active" server. This was mainly for consistency in the tests and screen grabs.

Before I begin, I think it's perhaps best if we ask ourselves what reasonable expectations we should have of vCSHB. The product makes no claims of fault tolerance, but it should provide a reasonably seamless restart of the vCenter services. Remember, vCSHB is an availability service that is designed to restart the vCenter service if it fails on the primary vCenter, or restart it on the secondary vCenter if it detects the primary has failed completely.

So just like VMware HA offers a restart of virtual machines (VMs) and not the continuous availability of the VM, the same applies to vCSHB. Of course, with vCSHB the secondary vCenter is already booted and waiting for the primary vCenter to fail, so the failover time is much quicker compared to a restart of a VM with VMware HA. Whilst vCSHB undoubtedly adds greater availability to service that previously had none, it can't defy the laws of physics. As an illustration of this, here are the steps that take place when failover occurs:

 

  1. The "make active" trigger is sent.
  2. Applications are shut down gracefully on the primary vCenter.
  3. The Neverfail packet filter is applied to prevent further data reaching the primary vCenter.
  4. Any existing data in the primary vCenter queue is transferred to the secondary passive vCenter.
  5. The primary vCenter is then marked as being passive.
  6. Any data in the secondary vCenter queue is applied to disk.
  7. The Neverfail packet filter on the secondary vCenter is disabled exposing the server to the network.
  8. The vCenter services are then started.
  9. The secondary server is now active.

Test 1: Planned manual failover -- maintenance mode for vCenter?
My first test would be a planned failover of the vCenter primary to the vCenter secondary. I assume this would be a common occurrence caused by patching and update procedures, such as Windows updates. One use of such active/passive clustering solutions like vCSHB is to drive service uptimes up to such a degree that no one sees that the service might be offline for essential maintenance or upgrade tasks. Whilst a ping –t does consistently report a steady value to the vCenter server, as you would expect the failover process is not completely unobtrusive.

If you have any vSphere clients open, by default they will be disconnected as the service on the primary vCenter is gracefully stopped and control is handed over to the secondary vCenter. With that said, you would be unlikely to deploy vCSHB to merely protect vSphere Client connectivity. In my experiments, I did try increasing the timeout value under the menu of the vSphere Client; however, I still found the client was disconnected.


Figure 1 (Click image for an enlarged view.)

During an unplanned failure, I found the vSphere client allowed for more time to reconnect to the lost vCenter. However, even when the vCenter services had restarted on the secondary vCenter, it still took a re-login to connect back into vCenter.

With hindsight, this method of testing is perhaps not the most fair. The vSphere client is, after all, not "cluster" aware in the sense that any lengthy disconnect is likely to cause it to fail. Additionally, all connections between vCenter and the client are secured with SSL, and this is likely to be the cause of the dropped connections I observed.

I think it would be great if services like VMware View and VMware SRM had an option to mark their respective vCenter configurations as "vCSHB" enabled.

 

Mike Laverick, Contributor,

I also think it's a reasonable expectation that if a failure did occur in the real world, it would be likely to affect users of the vSphere client, as this happens with other client applications. At the end of the day, it is likely the vSphere Client is very "stateful" and is not very receptive to dropped sessions. So it may be something we have to live with until such a time that the vSphere client networking is changed. In fairness, the real reason to test vCSHB is to verify that it gives better availability to the services that are dependent on it, such as VMware View, vCloud Director, Site Recovery Manager and so on.

As for carrying out this maintenance task, I personally think the manual failover process is best initiated on the secondary vCenter. There is an important caveat to remember if you attempt this whilst using Microsoft RDP to connect to vCenter.

Remember that in vCSHB, both the primary and secondary vCenter share the same IP addresses and the same DNS host name. This can make it difficult to connect to the right server using the Microsoft RDP method. The server that responds to any inbound RDP requests is whichever one holds the "active" status at any given time.

There are two waysa round this issue. If you are working in a V2V configuration like me, you might prefer to open the vSphere client directly on the ESX host and then open the VMware Remote Console. As this approach totally bypasses the vCenter, you are safe from any disconnect created by the failover process, and you are guaranteed to be connected to the right VM for your management.

If you were running vCenter on physical and wanted to run the vCSHB tools from its desktop, you might be better of using the servers ILO/DRAC card. To be honest, this issue is perhaps best resolved by installing the vCSHB tools to your management PC along side the vSphere client.

The screen-grab below shows the progress of transferring the active status to the secondary vCenter. a process that most folks would call a failover or what Neverfail calls "switchover." It took just under one minute and 30 seconds to complete, from step 0 to step 7. However, the vCenter service was available much sooner than this: at around step 4 when vCSHB carried out the "Reveal secondary on the network" task.


Figure 2 (Click image for an enlarged view.)

Test 2: Unplanned physical failure
For my next test, I wanted to try a hard failure by crashing the ESX host where the primary vCenter was currently running. To ensure that the primary and secondary would never reside on the same ESX hosts, I used VMware's DRS Anti-affinity rules to keep the two VMs apart. On top of DRS being enabled, I also had VMware HA enabled so I could watch the primary vCenter server being restarted by the VMware HA process.


Figure 3 (Click image for an enlarged view.)

Then, using the ILO/RAC feature available onboard most modern servers, I crashed the ESX host holding the primary "vc4nyc" VM. Of course, I could have just powered off the primary vCenter virtual machine, but I believe in rigorous testing (as I hope this article demonstrates). I strive to make my tests as realistic as possible, and I wanted to see how VMware HA interacted with vCSHB in a fully virtualised model.

In this case, the vCenter service was started on the secondary server. When the primary vCenter restarted, it was successfully powered on by VMware HA. DRS successfully separated the primary and secondary vCenter servers within the VMware cluster. Once the primary was rebooted and brought back on line by VMware HA, the primary checked its configuration and joined with the secondary once again. In this case, because the Primary wasn't cleanly shutdown, vCSHB by default marked it as the "passive" node. I was safe from any split-brain situation.

Test 3: Unplanned failure of vCenter during VMware View deployment
In my second test, I tried a much more realistic approach. I decided to crash the vCenter server at the very point that VMware View needs to speak to vCenter, which is during the creation of virtual desktop pools. In my test, I created a new virtual desktop pool, started the provisioning process, and then crashed the ESX host and the primary vCenter upon it. Of course, it was going to be difficult to monitor the deployment process success without the immediate availability of the vSphere Client. Sadly, the vCSHB failover did not engage quickly enough to stop the View deployment process from creating an error.


Figure 4 (Click image for an enlarged view.)

Fortunately, with VMware View it is possible to restart the provisioning process once the vCenter server is up and running again. What I would like to see is these high-level management tools become more "vCSHB aware." For example, I think it would be great if services like VMware View and VMware SRM had an option to mark their respective vCenter configurations as "vCSHB" enabled. So if a job like provisioning new virtual desktops failed because an outage of vCenter had happened, then that job would be restarted or cleaned up and a new job begun.

Test 4: Unplanned loss of the principal public network
My next test was to see how vCSHB would react to the primary vCenter becoming "orphaned" from the network, which would happen if there was a critical failure to its NIC or switch connectivity.

Remember, the principal public network is the channel that receives inbound requests from the vSphere client and other services that have vCenter dependencies. To simulate this, I decided I would set up a ping –t to the primary vCenter server. To simulate the network outage, I had the vSphere client open on the ESX host where the primary vCenter was located, and I used the ability to disconnect the network interface from the "Edit Settings" dialog box.


Figure 5 (Click image for an enlarged view.)

By causing a network disconnect, I was able to watch the "automatic switchover" process. It took precisely 10-pings for this to be triggered, just as I had configured earlier, and from the secondary vCenter I was able to observe the process using the vCSHB console.


Figure 6 (Click image for an enlarged view.)

The real reason to test vCSHB is to verify that it gives better availability to the services that are dependent on it.

 

Mike Laverick, Contributor,

The failover to the secondary vCenter was successful, but it did take a long time to complete. This is because the first step taken by vCSHB is an attempt to stop the services on the active vCenter that has become orphaned from the network. This time is quite long, as vCenter is configured with a lengthy timeout as it attempts to reconnect to its back-end SQL database. This "feature" is something I discovered whilst working with the Neverfail guys during the writing of this article.

The database downtime timeout value is 30 retry attempts, with an interval of 60 seconds. That can mean up to a maximum of 30 minutes to wait for the vCenters services to eventually stop. Fortunately, it is possible to edit the configuration file on vCenter (vpxd.cfg) and decrease this value to be a more acceptable wait time in the context of using vCSHB. The vpxd.cfg file is held in the location of C:/Users/All Users/VMware/VMware VirtualCenter. This path is actually an alias to C:/ProgramData, which is a hidden folder in Windows. The two parameters that can be added to the vpxd.cfg file -- the first is called "maxDatabaseDowntime" and the second is called the "retryInterval," which can be added under the ODBC settings like so:

<config>
<vpxd>
<odbc>
<maxDatabaseDowntime>120</maxDatabaseDowntime>
<retryInterval>10000</retryInterval>
</odbc>
<das>
<serializeadds>true</serializeadds>
<slotMemMinMB>256</slotMemMinMB>
<slotCpuMinMHz>256</slotCpuMinMHz>
</das>
<filterOverheadLimitIssues>true</filterOverheadLimitIssues>
</vpxd>
<vmacore>
<threadPool>
<TaskMax>30</TaskMax>
</threadPool>
</vmacore>
</config>

I worked quite closely with Neverfail and VMware during the writing of this article, and I'm pleased to see that a KB article on this issue has been recently released.

The next step in the recovery process would be to fully investigate the cause of the network outage to the primary vCenter and fix the failed component. In my case, as this was very easy, I simply needed to reconnect the VM NIC that corresponded to the principal public network and restart the Neverfail R2 service. Before doing this, it is recommended to run the vCSHB "Configure Server" utility to confirm that the orphaned server is aware of which server is currently "active" and that it is set to be the "passive" server.

Editor's Note: For the rest of this series, check out part one, part two and the final part.

Mike Laverick

ABOUT THE AUTHOR: Mike Laverick is a professional instructor with 15 years of experience with technologies such as Novell, Windows and Citrix, and has been involved with the VMware community since 2003. Laverick is a VMware forum moderator and member of the London VMware User Group Steering Committee. In addition to teaching, Laverick is the owner and author of the virtualisation website and blog RTFM Education, where he publishes free guides and utilities aimed at VMware ESX/VirtualCenter users. In 2009, Laverick received the VMware vExpert award and helped found the Irish and Scottish user groups. Laverick has had books published on VMware Virtual Infrastructure 3, VMware vSphere 4 and VMware Site Recovery Manager.

Read more on Server virtualisation platforms and management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

SearchCIO

SearchSecurity

SearchNetworking

SearchDataCenter

SearchDataManagement

Close