No matter how comprehensively a business prepares for unexpected events, no disaster recovery plan is going to cater for every eventuality. As analyst firm Gartner notes: "It is not possible to completely protect against every threat."
But some downtime can be avoided or the impact minimised if there is a clear action plan. At Computer Weekly's latest CW500 Club event, IT leaders discussed the real-life experiences of their peers in dealing with disasters.
"We connect hungry people to their food and it would be a disaster for us if people cannot place an order," says Amarpal Attwal, technology manager at online takeaway service Just Eat.
"The first disaster I was involved in was when the website for all our countries went down due to overloading of servers in a datacentre in Denmark," he says. The problem was due to capacity planning. Attwal admits the company had not been as agile as it could have been in its datacentre operations.
Another form of disaster for Just Eat would be if staff are unable to connect to internal tools. A second disaster involved a power failure in the company's UK head office, which had a much wider impact. "The UK office was the hub for all other country offices," he adds. At the time the firm did not have a plan to deal with disasters.
Attwal says the company had to take a look at its infrastructure and at the business to create a framework that outlined how it would respond to a disaster scenario and the cost impact. He says: "We needed to understand what was important for us from a disaster recovery point of view and we ended up architecting our entire ecosystem."
The result is that while the company still has its Danish datacentre, Just Eat is now cloud native. Atwall says: "We have moved everything over to Amazon Web Services, not only our commerce platform but also the corporate infrastructure as well."
Read more about disaster recovery
Effective project management in business continuity and disaster recovery requires proper plans
If a disaster knocks out on-premises UC systems, business grinds to a halt. But that's also when cloud UC's benefits around disaster recovery shine.
Like traditional hardware and software, cloud services are susceptible to network outages. This also applies to AWS, the market leader of public cloud providers.
Thanks to deploying the website and business systems on Amazon Web Services (AWS), Just Eat has removed physical servers from company offices. The result is a central core that the offices connect to. "We are also a great believer in SaaS [software as a service] - we want to secure our data and build to mitigate against failure,” Attwal adds.
Just Eat adopts a policy of expecting failure and architects its cloud systems in that way. The IT team uses a open-source tool called Chaos Monkey, first developed by Netflix, that intentionally creates failures in AWS system components to test responses and learn how to prevent them bringing down the whole operation.
In terms of best practice for handling datacentre disasters, Attwal says: "Practice is everything. We have this concept of war games where we lock ourselves in a room and pick something out of a hat such as, nobody on the organisation can log on. What do we do? We go over [handling the scenario] theoretically and then run it practically." Such scenarios should be run regularly rather than once a year.
Taking people into account
Attwal says people are often overlooked in disaster recovery (DR) planning: "We haven’t paid a lot of attention to succession planning, such as if 20 people leave the company." To tackle this Just Eat runs sessions where groups sit around a table and share their expertise.
To avoid having all the know-how locked in one person’s brain, Just Eat organises teams around project components. Attwal says: "We are more focused on objectives and have organised [teams] around component ownership groups, which splits the risks."
Kirk Langley, head of business continuity at investment management firm Brewin Dolphin, agrees that the people elements of disaster management are often overlooked.
"It's an uphill task to get collaboration across departments. I have worked in many organisations and you do get silos of business continuity people and IT people," he says.
James LodgeHead of IT disaster recovery, Nationwide
Langley argues that business continuity professionals need to understand a reasonable amount about IT: “In any type of organisation you will find someone will pinch your team. If you lose a key team of financial planners and investment managers in financial services, it’s a business problem but it is also an IT problem because you have to use IT to assign a new investment manager to all the clients."
Flexible work teams can enable a business to carry on if the head office is unavailable. But as Just Eat’s Attwal points out: "Nothing beats face-to-face interaction.” Not having everyone in the same room can also hinder planning, particularly in the early stages of a disaster when key people need to coordinate the step-by-step process they need to invoke in the company’s DR strategy.
As firms make increasing use of purpose-built datacentres and the cloud, having key people in the right place at the right time becomes essential. Adrian Moir, senior manager, systems consulting at Dell Software, says: "It is very hard to get 20 people in a datacentre around four racks and try to do everything together. Your disaster recovery [centre] is actually your office, where your talent and secondary hardware is located."
Moir believes datacentre DR plans often fail to take into account key IT personnel: "A lot of people forget about providing the service to the business and making sure people are active and productive."
While companies often operate active-active resilient datacentres, he says often IT forgets about who needs to access the datacentre: "Think about how much access to the equipment and applications the teams that need to execute the DR plan will need."
No one worries when systems are up and running - they only notice when the system fails. Then the business will ask: "How soon will we be up and running again?" This is the critical question, says James Lodge, head of IT disaster recovery at Nationwide Building Society.
"Many organisations can struggle to confirm how long a critical system will take to recover at any given moment in time," he says. The time from when the system goes down to when it is up and running again cannot always be estimated accurately.
A typical bank will run three types of systems, says Lodge - those that interact with the customer; the business systems like sales processing; and datacentre systems. Clearly some systems have greater visibility within the business at different times.
"If a system in sales falls over when all other systems are green, it has a higher priority than if that happened in the middle of a wider datacentre failure,” Lodge says. Another factor business continuity experts need to take into account is that criticality of a system will change through the day - for example, email is often much more important during office hours than overnight.
In the model he uses for disaster recovery (see diagram, below), Lodge sets out a timeframe which specifies how long it would take for any given business system to return to normal operations if it falls over.
From a DR perspective, core datacentre components like networks or Active Directory software are just as critical, since they will affect the ability of the business to operate properly. Unfortunately, it has often been difficult to accurately estimate how long a whole datacentre will take to get back online.
Traditionally it has only been possible to give a 24-to-48 hour estimate before everything has returned to normal operations after a datacentre outage. But by breaking down the datacentre into its constituent components, Lodge says it is possible to give the business a more accurate assessment of the recovery time. "This is a way of building up a more granular recovery for a datacentre," he says.
According to analyst Gartner, problems with a DR strategy arise because DR planning is not built top-down from an overall strategy, with appropriate priorities and objectives. Just Eat, for instance, needed to develop a framework for handling future DR events.
This strategy needs to stipulate where key people have to be – particularly if they are required for rebooting datacentre systems. Clearly these people need access to the business continuity site and access rights for the systems they are required to restart. This involves business continuity managers understanding which key IT people are required.
And while system dashboards will alert IT teams about a problem, other parts of the organisation may need to be alerted quickly, especially now customers will instantly take to social media if they have problems with websites or mobile apps, for example. "It is often the case that customers may become aware of an outage and start contacting the media, within as little as eight minutes," Lodge says.
As such he recommends an active and positive management of Twitter and social media feeds. Depending on the type of organisation, social media monitoring and a team to manage media is a must-have, according to Lodge.
No one can fully protect against a major systems outage. But, as Lodge notes: "Reputational damage can be far greater that the actual financial loss", and as such, modern disaster recovery strategies need to include more than just the task of getting IT systems back online.