British Airways IT outage: What went wrong with its datacentre?

BA has blamed “human error” for its bank holiday datacentre outage, but the Uptime Institute suggests there may be more to it than that

This article can also be found in the Premium Editorial Download: Computer Weekly: Why did British Airways’ datacentre crash?

The explanation offered by British Airways as to the cause of its bank holiday datacentre meltdown is insufficient, say experts, who slam the airline for putting the incident down to “human error”.

Speaking to Computer Weekly, Uptime Institute president Lee Kirby said the phrase is all too often used by firms to hide a multitude of datacentre design and training flaws, caused by years of underinvestment in their server farms.

“We have collected incident data and conducted root cause analysis for more than 20 years and have the largest database of incidents from which to draw industry-level trends,” he said. “One thing we have noticed is that ‘human error’ is an overarching label that describes the outcomes of poor management decisions.”

More than two decades have passed since the Uptime Institute published its Tier Standards Topology classification system, which gives operators a steer on how to build redundancy into their datacentres, but it seems the message is still not getting through to some, said Kirby.

“From a high-level point of view, the thing that is troubling me is that we’re still having major datacentre outages when we solved this problem 20 or more years ago with the introduction of the Tier Standards,” he said.

“If you had a Tier 3 datacentre with redundant distribution paths and equipment, you wouldn’t be running into these problems.”

Willie Walsh, CEO of BA’s parent company, IAG, confirmed this week that the May bank holiday outage was caused by an engineer disconnecting the power supply to one of its datacentres, before reinstating it incorrectly.

Major damage to servers

It is understood this led to a power surge, which caused major damage to the servers the airline uses to run its online check-in, baggage handling and customer contact systems, resulting in flights from Heathrow and Gatwick being grounded for the best part of two days.

If the system is properly designed, an incident of this nature should not cause an incident as severe as the one BA suffered, but that largely depends on when the site in question was built, Uptime Institute CTO Chris Brown told Computer Weekly.

“When they were built, some of the industry-accepted norms may have been a single UPS system and single distribution because most of the equipment in use at the time was single-ported, for instance,” he said.

“Management decisions about budget and cost and spending have not allowed these facilities to be upgraded over time to keep up with the demand and criticality of these systems.”

In the airline industry in particular, flight operators are under mounting pressure to cut costs in the face of growing competition from budget carriers, said Kirby, and the upkeep of their IT estates can be the first thing to suffer.

“Reducing the redundancy of builds is one of the first places they look and when they do that, they put themselves at risk. When something like this happens, the first thing they look for is a tech or sub-contractor to blame, when it’s really [down to] management decisions early on not to prop up the infrastructure, not to put up the training programmes to run 24/7,” he said.

Read more about the BA outage

Market forces have also conspired to change the way airlines use and depend on their IT assets, which brings its own pressures and problems, said Brown.

“A lot of the systems the airlines use have been around since the late 1970s and they weren’t really [designed] for client-facing systems. They were for internal use,” he said.

“Throughout the years, the systems have been updated and modified, but not holistically, because going in to rewrite all the systems from the ground up to use multiple datacentres and a lot of redundancy of IT assets is going to incur a lot of costs.

“A lot of the big carriers are being pressured by the more budget airline model to reduce their costs on the shorter-haul flights to keep customers, and the same applies to datacentres.”

Thorough review

For this reason, Kirby and Brown are urging BA to conduct a thorough review into how its datacentres are designed and managed, to prevent a repeat of such problems in the future.

“What they need to do is step back and get a holistic view of the entire situation,” said Brown. “What is the status of their IT systems and the facilities housing them, and what is the status and condition of their operations staff and team and programme?

“Then they’re going to need to create a plan to address that. It won’t be addressed in a short amount of time – it will take time, money and investment.”

Computer Weekly raised the points made by Kirby and Brown with BA, and was told that the company is in the process of conducting a thorough review.

“It was not an IT issue – it was a power issue,” a BA spokesperson said. “We know what happened, and we are investigating why it happened.

Read more on Infrastructure-as-a-Service (IaaS)

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

“One thing that we have noticed is that ‘human error’ is an overarching label that describes the outcomes of poor management decisions.”

This quote most accurately describes modern society. Management depends on recommendations from those who report to management. Management receives both good and bad recommendations. Management tends to choose the visually attractive presentation. Most often, the thought put into the visualization was not put into the actual content of the recommendation. Often, the visualization is a marketing presentation created by a marketing team that ignorant of the subject matter. This is true in software as well as hardware. it is true in non-IT as well as IT.

Management needs to be trained in how to choose the best recommendation from the many it receives.
well, spintreebob, that's a rather old school way of talking about management which will inevitably continue to fail - how exactly do you "train someone to make better decisions".  That sentence is easy to write but pretty meaningless in reality - these people have been on plenty of management courses and have plenty of experience.

Architecture is key and if produced well management can clearly see for itself what is required and whether it is satisfied, to the point they can even direct subordinates to point to the part of the architecture that is or is not satisfied by implementations.

It's not difficult but most business leaders aren't terribly interested.
Willie Walsh is doing PR.  He can't be accused of lying but if one "plug" brings down his entire IT operation then his team have no clue what they are doing and neither did he.

Fine the company £25m - £50m  for the crap it put its customers through unnecessarily and you can be sure that something will get done at least for this issue.
"It was not an IT issue - it was a power issue" is incorrect. It was a people issue to do with the training/competence of a subcontractor after the operation had been outsourced. If the individual concerned had indeed come into the UK on a Tier 2 Visa to replace a previous long service BA employee ...  
Without spelling out more on architecture, technology used and few more details, it is difficult to provide constructive suggestions....
Of course it was "human error". Remember that the CEO is a human too.

"it was not an IT issue – it was a power issue" shows a fundamental problem. The power sources for a computer system are an "IT issue". If the problem was that the lights or coffee machine went off in a call centre, then that's "not an IT issue", but if the power fails to one of your servers (for whatever reason) then that IS an "IT issue".
i am in agreement with the main idea here that blaming it on a generic "human error" is almost the equivalent of an I/O error...we know something went wrong after fact but nobody really knows how it came to that ... except that shortcuts that should have not taken ... have been taken.
Blaming it on a "human error" is a generic scapegoating phrasing that is the only spin the PR gurus behind BA's management have been able to come with to not only avoid revealing the Company's shortcoming but also probably avoid some more public introspection in their business and the way they run the IT behind it that could quite possibly lead to potential {massive} fines from one of the UK Watchdogs ...
in one word 'lame' ... on every aspects : human,IT,management,foreseeing and forecasting ...and simple common sense.
...the worrying thing about the state of IT is that you have had 6 comments that show a total lack of understanding of how IT Operations work in practice.  We have been told that a power issue caused the problem...(of course Uptime jump on the bandwagon, after all, there one great input was to make the case for dual corded devices).  the action that the operator did wrongly was to switch the servers back on....this did not "damage " the servers, but if they were switched on in the incorrect order then that will screw the entire system up.  To explain this simply to the un-initiated, if you enabled say a flight booking or dispatch system to run before the filestore or communications system was up and was running properly then the system as a whole will fail.
Most IT operations are outsourced these days...should BA concentrate on running air transport ops or IT?  Outsourcing is not bad inherently bad, but the management of this activity is important - operatives need to be trained and have their competency verified.  Lack of training, sure..nothing to do with management decisions or implied immigrant workers displacing "British" workers
If the system is so badly designed that switching things on in the wrong order (a) is possible, (b) doesn't work, and (c) causes things to go so wrong that it takes DAYS to recover, then whoever designed it shouldn't be working in IT. At the very most, it should have required turning off again, a bit of tweaking, and restarting in the correct order. A few hours downtime would have been "human error". A few days, and it's much more systemic than that.

They obviously hadn't done enough disaster recovery planning.

We run servers in a data centre and if they are turned on in the wrong order, things may not work immediately, but (if they don't recover automatically) up to 30 minutes work will get things going again. Yes, we (probably) have fewer servers than BA, but we also have fewer people looking after them.

Well Paul42 - In an Enterprise datacentre with a lot of servers and systems, I can assure you, from my own experience, that turning things on in the correct order can take up to 8 hours. It can take a further 24 or more hours to synchronise the backup site, but then maybe your business is not large enough or mission critical enough to have a remote backup site. You clearly have little understanding of large DC operations.  These complex systems were designed some 15 or more years ago and in a fast moving business environment you really cannot afford the downtime and business risk to continuously redesign.  How incredibly lucky you are to just do a little tweak and get it working again.

Modern management (read MBA) are trained in finance and business, they know next to nothing about engineering / facility design / redundancy. When provided with 2 differing design solutions, they will tend to accept the least costly choice (without being able to evaluate the cost benefit). The author of the lower cost solution is the "hero" for saving money, and 10, 15, or 20 years later when something goes wrong, ..... well they are both long gone and forgot. But they have both gotten promotions in the meantime and moved on to screw something else up.

A retired consultant and professional engineer