IT service continuity costs - not for the faint hearted?

IT service continuity – an overly ambitious quest that is pretty laughable for any but those with pockets deeper than those in the high-rolling financial industries?  Is it possible for an organisation to aim for an IT system that is always available, without it costing more than the organisation’s revenues?

I believe that we are getting closer – but maybe we’re not quite there yet.

To understand what total IT service continuity needs, it is necessary to understand the dependencies involved.

Firstly, there is the hardware – without a hardware layer, nothing else above it can run.  The hardware consists of servers, storage and networking equipment, and may also include specialised appliances such as firewalls.  Then, there is a set of software layers, from hypervisors through operating systems and application servers to applications and functional services themselves.

For total IT service continuity, everything has to be guaranteed to stay running – no matter what happens.  Pretty unlikely, eh?

This is where the business comes in.  Although you are looking at IT continuity, the board has to consider business continuity.  IT is only one part of this – but it is a growing part, as more and more of an organisation’s processes are facilitated by IT. The business has to decide what is of primary importance to it – and what isn’t so important.

For example, keeping the main retail web site running for a pure eCommerce company is pretty much essential, whereas maintaining an email server may not be quite so important.  For a financial services company, keeping those parts of the IT platform that keep the applications and data to do with customer accounts running will be pretty important, whereas a file server for internal documents may not be.

Now, we have a starting point.  The business has set down its priorities – IT can now see if it is possible to provide full continuity for these services.

If a mission critical application is still running in a physical instance on a single server, you have no chance.  This is a disaster waiting to happen.  The very least that needs doing is moving to a clustered environment to provide resilience if one server goes down.  Same with storage – data must be mirrored (or at least run over a redundant array, preferably based on erasure code redundancy, but at least RAID 0). Network paths also need redundancy – so dual network interface cards (NICs) should also be used.

Is this enough?  Not really.  You have put in place a base level of availability that can manage with a critical item failure – a single server, a disk drive or a NIC can fail, and continuity will still be there.  How about for a general electricity failure in the data centre?  Is your uninterruptable power supply (UPS) up to supporting all those mission critical workloads – and is the auxiliary generator up to running such loads for an extended period of time if necessary?  What happens if the UPS or generator fails – are they configured in a redundant manner as well?

Let’s go up a step: let’s use virtualisation as a platform, rather than a simple physical layer, let’s now put in a hypervisor and go virtual.  Do this across all the resources we have – servers, storage and network – and a greater level of availability is there for us. The failure of any single item should have very little impact on the overall platform – provided that it has been architected correctly.  To get that architecture optimised, it really should be cloud.  Why? Because a true cloud provides flexibility and elasticity of resources – the failure of a physical system where a virtual workload has a dependency can be rapidly (and, hopefully, automatically) dealt with through applying more resource from a less critical workload.  Support all of this with modular UPSs and generators, and systems availability (and therefore business continuity) is climbing.

Getting better – but still not there.  Why?  Well – an application can crash due to poor coding – memory leaks, a sudden trip down a badly coded path that has never been used before, whatever.  Even on a cloud platform, such a crash will leave you with no availability – unless you are using virtual machines (VMs).  A VM contains a copy of the working application that can be held on disk or in memory, and so can be spun up to get back to a working situation rapidly.

Even better are containers – these can hold more than just the application; or less.  A container can be everything that is required by a service above the hypervisor, or it can be just a function that sits on top of a virtualised IT platform. Again, these can be got up and live again very rapidly, working against mirrored data as necessary.

Wonderful.  However, the kids on the back seat are still yelling “are we there yet” – and the answer has to be “no”.

What happens if your datacentre is flooded, or there is a fire, an earthquake or some other disaster that takes out the datacentre?  All that hard work carried out to give high availability comes tumbling down – there is zero continuity.

Now we need to start looking at remote mirroring – and this is what has tended to scare off too many organisations in the past.  Let’s assume that we have decided that cloud is the way to go, with container-based applications and functions.  We know that the data, being live, cannot be containerised, so that needs to be mirrored on a live, as-synchronous-as-possible basis.  Yes, this has an expense against it – it is down to the business to decide if it can carry that expense, or carry the risk of losing continuity. Bear in mind that redundancy of network connections will also be required.

With mirrored data, it then comes down to whether the business has demanded immediate continuity, or whether a few minutes of down time is OK.  If immediate, then ‘hot’ spinning images of the applications and servers will be required, with elegant failover from the disaster site to the remote site.  This is expensive – so may not be what is actually required.

Storing containers on disk is cheap – they are taking up no resources other than a bit of storage.  Spinning them up in a cloud-based environment can be very quick – a matter of a few minutes.  Therefore, if the business is happy with a short break, this is the affordable IT service management approach – mirrored live data, with ‘cold’ containers stored in the same location, and an agreement with the service provider that when a disaster happens, they will spin the containers up and place them against the mirrored data to provide an operating backup site.

For the majority, this will be ideal – some will still need full systems availability for maximum business continuity.  For us mere mortals, a few minutes of downtime will often be enough – or at least, downtime for most of our systems, with maybe one or two mission critical systems being run as ‘hot’ services to keep everyone happy.