Feature

How to plan and manage datacentre redundancy

Provisioning of a datacentre to survive a range of failure scenarios has become critical for many businesses. Here's a datacentre redundancy plan

Clive Longbottom

Published: 05 Aug 2013

In 2008, author Nicholas Carr argued that the IT department would not survive in the form most people were used to, as organisations moved the bulk of their IT out of owned facilities to the cloud.

Five years since, this has not happened quite as Carr imagined – and this brings us to two different outlooks towards a “redundant” datacentre.

Redundancy in IT is a system design in which a component is duplicated so if it fails there will be a backup. In a datacentre, there may be redundant components, such as servers, or network system components, such as fans, hard disk drives, operating systems and telecommunication links, that are installed to back up primary resources in case they fail.

If Carr had been right, then organisations would be looking at what to do with excess – or redundant – datacentre capacity. Redundancy has a negative connotation when the duplication is seen by the business as unnecessary.

Yes, for some businesses this datacentre capacity excess is an issue, but for the majority, the other form of redundancy – the provisioning of a datacentre to survive a range of failure scenarios – has become even more of an issue.

IT infrastructure is part of an organisation’s DNA. If someone were to cut off the IT service for an organisation, it would not be a small snag, but a corporate catastrophe for its operations. Business processes would halt, customers would be left stranded, suppliers would be unable to know what was required to be delivered, the organisation would struggle to pay its employees what they are owed, communication and collaboration would be severely impaired.

More tips on datacentre redundancy

How PEER1 datacentre kept running amid hurricane Sandy
Planning datacentre redundancy beyond storage and software
Choosing hardware for datacentre redundancy

The overall availability of an IT platform means that an approach of a single application on a single physical server with dedicated storage and individual dedicated network connections is a strategy to oblivion. It is incumbent on IT to ensure that the IT platform can continue to operate through failures – as long as the cost of doing so meets the organisation’s own cost/risk profile.

When considering just how redundant a datacentre should be, it is best to consider failure scenarios as a scale.

Such an approach will help datacentre professionals to assess the cost of each outage and take it to the business – the business stakeholders can then decide at what point the cost of managing the failure (moving to a disaster recovery plan) becomes lower than the cost of surviving the failure (a business continuity plan).

Failure scenarios for datacentre redundancy

Analyst firm Quocirca’s scale of failure scenarios for datacentre redundancy includes the following:

1. Component failure – for example, where a power supply or a disk drive fails

The use of an “N+1” approach (having one more component than is really needed) can generally see an IT platform through this. For example, using two power supplies into a server or a Raid system for storage will generally provide enough time for the component to be replaced.

For systems where failure is just not acceptable or affordable, then an “N+M” approach (having more than one extra component in place) may be used.

Within the facility itself, the use of more modular uninterruptible power supplies (UPSs), generators and chillers with in-built N+1 redundant power supplies, batteries and so on can be used.

Monolithic facilities equipment does not lend itself easily to this approach.

2. Assembly failure – for example, the failure of a complete server or a storage system

With virtualisation, greater levels of availability can be provided through mirroring live images within the same system. Where physical platforms are still in use, clustering, storage mirroring and multiple network interface cards (NICs) will provide resilience to failure.

Again, within the facility, the key is to move to modular systems. For example, if the UPS consists of five sub-modules, an N+1 approach will require six modules – or 120% of the actual requirement. If a monolithic approach is taken, an N+1 approach will result in 200% of the actual requirement, with the associated higher capital and maintenance costs.

3. Room failure – for example, through power distribution failure

This would require the building of two datacentres within the same building with the facility services being mirrored across each as N+1 power distribution networks, UPSs, cooling systems and so on. This is, by its very nature, far too expensive, and so would tend to be dealt with as a site failure as below (scenario number 5).

The key strategy to room (and building) failure is to avoid it wherever possible. The use of N+1 strategies at the equipment level can help, along with the use of environmental monitoring systems to give early identification of possible hot spots developing, smoke appearing or moisture levels increasing.

4. Building failure – for example, through fire or flood

This would need the mirroring of the datacentre to another, which could be within the same campus. With the use of virtualisation and cloud, this is again probably too expensive for the majority. Best to regard this as a site failure as well.

When considering just how redundant a datacentre should be, it is best to consider failure scenarios as a scale

5. Site failure – for example, caused through a local power failure or a break in connectivity through cable/fibre fracture

This is where longer-distance mirroring comes in. The use of a separate facility with cold or hot standby resources to switch over to maintain business capability is the only real way to deal with this. Data management starts to become more of an issue, as latency in systems starts to introduce the capacity to lose transactions.

6. City failure – for example, due to major disruption such as terrorism activity, storm or power grid failure

From this point on, full mirroring of capabilities at the IT level will be required. At the facility level, the organisation has the choice to mirror the facility, or to use an external infrastructure as a service (IaaS) or platform as a service (PaaS) provider to enable a suitable platform to be immediately or very rapidly provisioned.

7. Regional failure – for example, due to major natural disaster such as earthquake or tsunami

Again, here an organisation will be looking at total mirroring, and Quocirca recommends looking to move away from facilities mirroring for cost reasons.

8. Country failure – for example, due to civil war or epidemic outbreak

Here, much longer distances for mirroring are required. However, many co-location companies (such as Equinix or Interxion) can provide long-distance, low-latency dedicated connections between their facilities, enabling long-distance business continuity to be enabled without a need for complex data management to deal with latency.

This leaves us with two additional redundancy scenarios:

1. Geographic failure

Replicating and mirroring between continents is not as hard as it once was. Again, many colocation provides may be able to help here.

2. World failure

At this level, IT and the business may have other matters to worry about. However, if you really want to be able to put in place a possible disaster recovery plan for this, send your data out from the planet as a maser (a microwave-emitting version of the laser) data stream. Provided you can then get ahead of it at some future time, you can recapture all your data for use from another planet.

Clive Longbottom is service director at analyst Quocirca. The datacentre consultancy firm has three papers that cover ITLM and IT financing available for free download here: Using ICT financing for strategic gain; Don’t sweat assets, liberate them; and De-risking IT lifecycle management.