British Airways IT outage: What went wrong with its datacentre?

BA has blamed “human error” for its bank holiday datacentre outage, but the Uptime Institute suggests there may be more to it than that

This article can also be found in the Premium Editorial Download: Computer Weekly: Why did British Airways’ datacentre crash?

The explanation offered by British Airways as to the cause of its bank holiday datacentre meltdown is insufficient, say experts, who slam the airline for putting the incident down to “human error”.

Speaking to Computer Weekly, Uptime Institute president Lee Kirby said the phrase is all too often used by firms to hide a multitude of datacentre design and training flaws, caused by years of underinvestment in their server farms.

“We have collected incident data and conducted root cause analysis for more than 20 years and have the largest database of incidents from which to draw industry-level trends,” he said. “One thing we have noticed is that ‘human error’ is an overarching label that describes the outcomes of poor management decisions.”

More than two decades have passed since the Uptime Institute published its Tier Standards Topology classification system, which gives operators a steer on how to build redundancy into their datacentres, but it seems the message is still not getting through to some, said Kirby.

“From a high-level point of view, the thing that is troubling me is that we’re still having major datacentre outages when we solved this problem 20 or more years ago with the introduction of the Tier Standards,” he said.

“If you had a Tier 3 datacentre with redundant distribution paths and equipment, you wouldn’t be running into these problems.”

Willie Walsh, CEO of BA’s parent company, IAG, confirmed this week that the May bank holiday outage was caused by an engineer disconnecting the power supply to one of its datacentres, before reinstating it incorrectly.

Major damage to servers

It is understood this led to a power surge, which caused major damage to the servers the airline uses to run its online check-in, baggage handling and customer contact systems, resulting in flights from Heathrow and Gatwick being grounded for the best part of two days.

If the system is properly designed, an incident of this nature should not cause an incident as severe as the one BA suffered, but that largely depends on when the site in question was built, Uptime Institute CTO Chris Brown told Computer Weekly.

“When they were built, some of the industry-accepted norms may have been a single UPS system and single distribution because most of the equipment in use at the time was single-ported, for instance,” he said.

“Management decisions about budget and cost and spending have not allowed these facilities to be upgraded over time to keep up with the demand and criticality of these systems.”

In the airline industry in particular, flight operators are under mounting pressure to cut costs in the face of growing competition from budget carriers, said Kirby, and the upkeep of their IT estates can be the first thing to suffer.

“Reducing the redundancy of builds is one of the first places they look and when they do that, they put themselves at risk. When something like this happens, the first thing they look for is a tech or sub-contractor to blame, when it’s really [down to] management decisions early on not to prop up the infrastructure, not to put up the training programmes to run 24/7,” he said.

Read more about the BA outage

Market forces have also conspired to change the way airlines use and depend on their IT assets, which brings its own pressures and problems, said Brown.

“A lot of the systems the airlines use have been around since the late 1970s and they weren’t really [designed] for client-facing systems. They were for internal use,” he said.

“Throughout the years, the systems have been updated and modified, but not holistically, because going in to rewrite all the systems from the ground up to use multiple datacentres and a lot of redundancy of IT assets is going to incur a lot of costs.

“A lot of the big carriers are being pressured by the more budget airline model to reduce their costs on the shorter-haul flights to keep customers, and the same applies to datacentres.”

Thorough review

For this reason, Kirby and Brown are urging BA to conduct a thorough review into how its datacentres are designed and managed, to prevent a repeat of such problems in the future.

“What they need to do is step back and get a holistic view of the entire situation,” said Brown. “What is the status of their IT systems and the facilities housing them, and what is the status and condition of their operations staff and team and programme?

“Then they’re going to need to create a plan to address that. It won’t be addressed in a short amount of time – it will take time, money and investment.”

Computer Weekly raised the points made by Kirby and Brown with BA, and was told that the company is in the process of conducting a thorough review.

“It was not an IT issue – it was a power issue,” a BA spokesperson said. “We know what happened, and we are investigating why it happened.

Read more on Infrastructure-as-a-Service (IaaS)

CIO
Security
Networking
Data Center
Data Management
Close