Achieving high availability in a virtualised environment

High availability (HA) in virtual environments now demands tools like spare hardware provisioning and system health checks for rapid fault detection.

Networking infrastructure equipment enjoys greater reliability by architecting for high availability (HA) and deploying a mix of commercial off-the-shelf (COTS) hardware as well as commercial and open source software components. Systems at the core and the edge of the network, once highly dependent upon custom and proprietary platforms, today build on standards-based carrier grade OSes, service availability forum APIs and AdvancedTCA hardware, and boast five and six nines of availability.

By combining key HA technologies and practices with virtualisation, data centres can also realise benefits of higher availability for existing mainstream data centre hardware and software platforms. The tip explains the essentials of HA and how to use high availability methods to increase data centre availability.

High availability defined and measured

Availability is commonly expressed as the ratio of acceptable system uptime to the total time in a given period, most often in one year. So, if your installation can tolerate one day of downtime in the course of 365, then your required availability equals 364/365 or 99.73%.

Systems offering high degrees of availability promote themselves in terms of the number of nines supported. Highly available systems boast four, five or six nines.

Nines Application Up time % Actual down time
2 Office equipment 99% 3 days, 15.6 hours
3 Most IT infrastructure 99.9% 8.76 hours
4 Internet infrastructure 99.99% 52 minutes 34 seconds
5 PSTN and other business critical 99.999% 5 minutes, 16 seconds
6 Carrier class core/edge 99.9999% 32.56 seconds

In the real world, downtime is expressed from statistically obtained values for mean time to failure (MTTF). As important as downtime is the time needed to repair a fault - mean time to repair (MTTR).

Availability, then, is calculated as:

Availability = MTTF / (MTTF + MTTR)

If a system or component offers 50,000 hours MTTF, and it takes on average 15 minutes to repair or replace it (e.g., to find and swap out a disk or a blade), then availability for that system would equal 99.9995%, or five nines.

Using this formula, it is easy to see how architects can enhance total availability by using more reliable hardware and software components - thereby increasing MTTF - and/or by reducing the duration and impact of faults - by decreasing MTTR.

HA: Not one size fits all

Laypeople tend to think about catastrophic IT equipment failures lasting hours or days. By contrast, networked data voice infrastructure systems are optimised to tolerate many and frequent short outages, each often less than one second, and to recover quickly and gracefully.

In datacom and telecom, HA capability builds on a mix of specialised and COTS hardware and software. Today that mix includes advanced TCA blades, redundant Ethernet, RAID, Carrier Grade Linux, journaling file systems and HA middleware. Data centres and other enterprise IT locales can also improve availability with more conventional hardware and software.

Deploying these and other technologies helps effect greater availability by

  • Eliminating single points of failure - CPUs, storage, interfaces, programs, etc.
  • Accelerating fault detection, isolation and resolution

HA system architects achieve this first design goal primarily through redundancy, in particular by provisioning spare hardware and software in varying states of readiness:

  • Hot spares -- Extra instance(s) of running hardware or software with states closely or precisely tracking resources in actual use. A hot spare blade server would mirror or checkpoint transactions and state data of its active counterpart, minimising time and disruption for fail-over should the active instance fail.
  • Warm spares -- Available instances of hardware or software that have been powered up or initialised but which do not closely track the state of active resources. At failover, a warm spare must reconstitute active state information or restart previously active transactions or sessions.
  • Cold spares -- Instances of comparable hardware or software program images to substitute for failed active instances, but which must first power-on, boot, load or otherwise initialise and reconstitute all state information before failover can occur.

In general, the hotter the spare, the more expensive the solution.

The second design goal - accelerating fault detection, isolation and resolution - can build on existing fault detection mechanisms, like device driver time-outs and protocol retry. The following technologies increase availability by streamlining failover, periodically polling the state of running applications, backing up and synchronising state information for running hardware and software:

  • Health monitoring -- Software APIs and hardware interfaces to monitor status of programs, interfaces, drivers and hardware itself
  • Heart beating -- Healthy applications or nodes periodically check in with heartbeat monitor software. Failure to check in invokes remedial action by the monitor to restart or failover the node
  • High/low watermarks -- Setting and resetting alarm conditions when resources like available memory, buffers, bandwidth, etc. reach near-critical and nominal states
  • Watchdogs -- System-wide timers that restart or reset applications and entire OSes upon timeout. Healthy nodes periodically reset the timers as they run; run-away or frozen systems let watch dogs time out
  • Check pointing -- Applications and OSes themselves or through external daemons periodically log or back-up key data structures, entire data segments or memory images. Checkpoint data can reside off-line or be used to update warm/hot spares dynamically

Leveraging virtualisation for High Availability

The traditional locus of increasing availability in enterprise IT has been clustering, in which multiple systems or blades are loosely coupled together to act as a single system. Clustering solutions, unfortunately, have suffered from highly proprietary and intrusive implementation, and from conflicting design goals.

Clustering paradigms tend to force both independent software vendors and end-users to use customise deployments to fit the architectures and APIs specific to vendors and their particular solutions. While unmodified production and legacy code do benefit from simple rehosting on clustered environments, the greatest benefits are realised through more thoroughgoing, intrusive and costly migration. Moreover, most clustering solutions tend first to focus on performance and load balancing, and second on enhancing availability; those that start with availability as a design goal usually offer lacklustre performance.

As an alternative, virtualisation can provide an economical platform for higher availability, hosting multiple redundant virtual instances of critical systems and resources rather than provisioning additional hardware. IT managers can gain availability from explicit redundant deployment of systems and applications in virtual machines, or implicitly, as pointed out by Fadi Nasser of embedded virtualisation supplier Virtual Logix: "Virtualisation lets enterprise appliances achieve higher availability with software techniques that inexpensively mimic traditional dedicated hardware-centric HA systems."

With minimal, incremental investments, IT managers can use virtualisation as an HA platform through:

  • Elimination of cold (physical) spares by maintaining snapshots of stable virtual machines.
  • Fast failover to warm spare virtual images.
  • Clustering with both virtual and physical machines or with virtual clusters spread across physical machines.
  • Isolation, monitoring and fast restart of unreliable applications and systems.
  • Improved availability of legacy code without re-architecting or adding HA application wrappers.
  • OS-level watchdogs and heartbeat monitors to force restart of virtual machines implemented with simple scripts, timers and assertions.
  • Teaming or fusing of physical and virtual network interfaces.
  • Virtualised MAC and IP addresses to ease load sparing and failover migration of network interfaces without huge networking configuration and routing impact.

Virtualisation and a little scripting can be used to implement traditional HA constructs:

  • Use local spare virtual machine instances for faster failover, but also force spares to run in virtual machines on remote systems to limit impact of hardware faults.
  • Checkpoint using virtual machine snapshot functionality.
  • Set and check high and low watermarks with shell commands like df and free and entries in the Linux /proc and /sys file systems on both host and virtual machine file systems.
  • Use local alarms and signals to implement watchdogs or build simple daemons to act as watchdogs across networks.
  • Invoke virtual machine instances via scripts that catch SIGCHILD signals.
  • Use readily available mechanisms like MIBs, BIOS calls, /proc and /sys entries for basic health monitoring.

The gotchas

Some HA techniques and technologies, however, outstrip the capabilities of virtualisation platforms:

  • Extremely rapid fault detection and failover (~50-100 milliseconds, à la telecom blades).
  • Elimination of all single points of failure without redundant hardware.
  • Application check pointing and state synchronisation without additional software.
  • Comprehensive application and node health-monitoring and heart-beating without addition software.
  • Fault tolerant and multi-path storage without dedicated hardware.


IT managers and architects can look to a rich and varied toolbox containing both commercial and community resources for enhancing availability. They gain new tools by combining COTS virtualisation with HA techniques, platforms and middleware. Enterprise virtualisation platform suppliers like VMware are starting to offer basic HA functionality in their product lines, with more aggressive approaches by embedded virtualisation suppliers that cater to networking infrastructure. You can also leverage commercial and open source middleware for health monitoring, heart beating and failover, where the managed objects are no longer physical blades or interfaces but virtual machines, guest OSes and applications running on them.

A good place to start is your own installation's history of faults and costly downtime. Make incremental investments to protect your most critical resources, like redundant provisioning across virtual machines and abstraction/virtualisation of key network interfaces.

Ultimately, virtualisation is just another tool to use to enhance availability and reliability. The heuristics and mechanisms described in this article will not themselves guarantee be

Read more on Network monitoring and analysis