Hyper-converged infrastructure and disaster recovery

Hyper-converged infrastructure products merge server, storage and hypervisor in scale-out nodes, so how can functionality in HCI products help deliver disaster recovery?

Hyper-converged infrastructure delivers cost savings through a combination of slimmed-down hardware and greater operational improvements.

The operational benefits of hyper-converged allow organisations to focus on the needs of the application rather than feed the infrastructure beast.

Disaster recovery is a major component of any infrastructure design, including hyper-converged infrastructure.

So, can hyper-converged architecture reduce the impact of implementing disaster recovery?

If so, what features in hyper-converged enable it?

Hyper-converged data protection

Hyper-converged infrastructure provides many capabilities through the scale-out node-based architecture of its products.

Under the hood, storage features like data deduplication and snapshots enable the copying of data in a cluster or between sites. Meanwhile, the implementation of storage is generally hidden from the customer and actions happen at the virtual machine or instance level.

And suppliers are starting to use the public cloud as a target for their systems. VMware, Nutanix and Scale Computing all have cloud operations. This makes sense as third party suppliers have supported virtual machine disaster recovery in the public cloud for some time.

Let’s dig into the detail of exactly what disaster recovery should mean and how hyper-converged infrastructure helps.

Disaster recovery requirements

Disaster recovery describes the process of returning to normal operations after a major incident that can result in downtime for one or more applications. Most disaster recovery incidents occur as a result of the following scenarios:

  • Site loss: A single datacentre suffers a major problem, such as a fire or flood.
  • Equipment loss: Some equipment in a datacentre is damaged.
  • Component failure: A single server or set of devices fails due to hardware issues or power problems.
  • Connectivity failure: Systems are inaccessible to the outside world due to networking problems.

Disaster recovery is typically part of a wider strategy of business continuity, with specific recovery times and recovery points dictated by cost and application requirements.

Historically, implementing disaster recovery has meant deploying redundant equipment, usually at a backup site. For fast recovery, this means incurring the expense of secondary storage arrays and standby servers.

Hyper-converged disaster recovery features

As we look at how hyper-converged can help in disaster recovery, it’s worth recapping some of the features hyper-converged solutions offer.

Typically, hyper-converged is implemented across multiple servers or nodes that collapse traditionally separate storage, server and virtualisation software into a single scale-out architecture. 

This scale-out capability enables IT departments to grow the infrastructure with much finer granularity; in many cases a single node at a time. It also allows workloads to be distributed across nodes, providing resiliency in the event of hardware failure.

Hyper-converged implements application abstraction, so users and administrators don’t have to know exactly where the data for an application resides. Applications are spread across a cluster of nodes, with individual node or storage device failure managed automatically through the hyper-converged infrastructure software.

The definitions for any single virtual machine are a combination of the metadata retained by the system, plus the data physically on disk. Recreating that virtual instance elsewhere only needs those two components.

We’ll see later that with features like data deduplication, applications can be recovered very easily to another location, without shipping lots of data.

Scale-out benefits

As most hyper-converged infrastructure solutions are scale-out, distributing data (and applications) across many nodes provides resiliency.

If these nodes can be safely geographically dispersed (using a metro cluster), then a hyper-converged infrastructure can provide against the failure of a single rack or a datacentre.

As data is distributed across multiple nodes, there is little or no impact in moving a virtual machine between locations.

Software-defined benefits

Although it is in many cases delivered as a hardware product, hyper-converged infrastructure solutions are by their nature software-defined.

Applications are virtualised using a hypervisor like vSphere, Hyper-V or KVM. And storage is virtualised and presented as a logical pool of capacity, irrespective of the underlying hardware capabilities.

What this means is that an application on hyper-converged infrastructure is effectively hardware agnostic, and can be moved to another cluster or server with minimal performance impact.

It’s true that traditional virtualisation can do this, but replicating a virtual machine in traditional virtual solutions (such as vSphere) means implementing shared storage (with multiple arrays) or using hypervisor-native replication features which are generally much less efficient than the hyper-converged infrastructure storage layer.

Read more about disaster recovery and virtualisation

The hypervisor and storage components of hyper-converged infrastructure offer additional features to make disaster recovery easier.  Both vSphere and Hyper-V offer interfaces to manage virtual machine snapshots, taking only delta changes after the first initial copy.

Hyper-converged infrastructure solutions have built-in snapshot capabilities that work in conjunction with the hypervisor or as an additional feature.

Finally, most hyper-converged infrastructure solutions implement data deduplication, making it easy to move applications around after that first initial copy between locations takes place.

Cloud disaster recovery

Being software defined means the public cloud can be used as a replication target. Many hyper-converged suppliers offer replication to the public cloud, either as a backup target or to fully deploy the application as a cloud instance. We will discuss more below.

Hyper-converged pitfalls

Are there any potential problems with hyper-converged infrastructure and disaster recovery?

Obviously, with no dedicated storage component there’s no offload for replication, which can make implementing synchronous replication harder.

Replication between hyper-converged infrastructure clusters is usually implemented through shipping snapshots rather than replicating individual blocks because of the performance impact and system overhead.

Where data protection is distributed and implemented with erasure coding, there could be a performance penalty similar to that for synchronous replication solutions.

Disaster recovery replication to a secondary location and cluster means deploying a minimum hardware footprint. With hyper-converged infrastructure, this footprint doesn’t have to be a similar configuration to the primary environment, which allows a minimal configuration to be used for only the most critical apps in a disaster recovery scenario.

Alternatively, clusters can be run in multiple locations, failing over to one main site in case of a disaster.

Hyper-converged product options

VMware Virtual SAN, used in Dell EMC VxRail products, provides the capability to build out a stretched cluster between geographic locations. Stretched clusters are typically built across metro locations where latency impacts are minimal. A vSAN stretched cluster can provide zero data loss and near zero downtime (other than a VM restart), supporting rack, node or site failure. VMware has also announced the ability to replicate data into VMware for AWS, however at this stage, the process seems more for workload migration rather than disaster recovery.

HPE Simplivity builds disaster recovery into the platform with the capability to migrate virtual machines between clusters of nodes in geographically-dispersed locations. The RapidDR feature enables fast automated recovery from site failure that manages internal VM settings such as IP addresses. The ability to implement rapid failover from one site to another is based on the deduplication engine in each Simplivity Omnistack node that ensures only unique data has to be transferred between locations in order to manage a VM recovery.

Nutanix uses a number of features in its Enterprise Cloud Platform that facilitates disaster recovery. Availability Domains provide the capability to survive node and rack failures, by intelligently distributing data across multiple nodes in a cluster. Clusters can be stretched to implement metro-level availability.

Replication copies data between nodes/clusters in a range of options, from traditional point-to-point designs up to full-mesh configurations. Data deduplication ensures only unique data is moved between clusters. Cloud Connect enables Nutanix VMs to be replicated into the public cloud (AWS or Azure). A cloud deployment consists of a single logical Nutanix node running in a cloud instance backed by S3 storage.

Scale Computing recently announced the ability to run HC3 (the Scale operating system) on Google Cloud Platform, providing a failover capability from on-premise deployments. This service is expected to replace the current offering where Scale Computing itself runs the disaster recovery environment. Customers can already create multiple site deployments of HC3 clusters and replicate instances between clusters to implement disaster recovery. Google Cloud Platform support provides the ability to implement disaster recovery without the cost of an additional site.

Cisco HyperFlex implements disaster recovery through the ability to replicate virtual machine images. VM replicas are based on zero-cost snapshots (metadata replicas) with WAN traffic optimised using techniques such as compression and delta (changed block) shipping.

Next Steps

What are the challenges of protecting data in HCI shops

Read more on Hyper-converged infrastructure