News

British Airways IT outage: What went wrong with its datacentre?

BA has blamed “human error” for its bank holiday datacentre outage, but the Uptime Institute suggests there may be more to it than that

Caroline Donnelly, Senior Editor, UK

Published: 09 Jun 2017 16:05

The explanation offered by British Airways as to the cause of its bank holiday datacentre meltdown is insufficient, say experts, who slam the airline for putting the incident down to “human error”.

Speaking to Computer Weekly, Uptime Institute president Lee Kirby said the phrase is all too often used by firms to hide a multitude of datacentre design and training flaws, caused by years of underinvestment in their server farms.

“We have collected incident data and conducted root cause analysis for more than 20 years and have the largest database of incidents from which to draw industry-level trends,” he said. “One thing we have noticed is that ‘human error’ is an overarching label that describes the outcomes of poor management decisions.”

More than two decades have passed since the Uptime Institute published its Tier Standards Topology classification system, which gives operators a steer on how to build redundancy into their datacentres, but it seems the message is still not getting through to some, said Kirby.

“From a high-level point of view, the thing that is troubling me is that we’re still having major datacentre outages when we solved this problem 20 or more years ago with the introduction of the Tier Standards,” he said.

“If you had a Tier 3 datacentre with redundant distribution paths and equipment, you wouldn’t be running into these problems.”

Willie Walsh, CEO of BA’s parent company, IAG, confirmed this week that the May bank holiday outage was caused by an engineer disconnecting the power supply to one of its datacentres, before reinstating it incorrectly.

Major damage to servers

It is understood this led to a power surge, which caused major damage to the servers the airline uses to run its online check-in, baggage handling and customer contact systems, resulting in flights from Heathrow and Gatwick being grounded for the best part of two days.

If the system is properly designed, an incident of this nature should not cause an incident as severe as the one BA suffered, but that largely depends on when the site in question was built, Uptime Institute CTO Chris Brown told Computer Weekly.

“When they were built, some of the industry-accepted norms may have been a single UPS system and single distribution because most of the equipment in use at the time was single-ported, for instance,” he said.

“Management decisions about budget and cost and spending have not allowed these facilities to be upgraded over time to keep up with the demand and criticality of these systems.”

In the airline industry in particular, flight operators are under mounting pressure to cut costs in the face of growing competition from budget carriers, said Kirby, and the upkeep of their IT estates can be the first thing to suffer.

“Reducing the redundancy of builds is one of the first places they look and when they do that, they put themselves at risk. When something like this happens, the first thing they look for is a tech or sub-contractor to blame, when it’s really [down to] management decisions early on not to prop up the infrastructure, not to put up the training programmes to run 24/7,” he said.

Thorough review

For this reason, Kirby and Brown are urging BA to conduct a thorough review into how its datacentres are designed and managed, to prevent a repeat of such problems in the future.

“What they need to do is step back and get a holistic view of the entire situation,” said Brown. “What is the status of their IT systems and the facilities housing them, and what is the status and condition of their operations staff and team and programme?

“Then they’re going to need to create a plan to address that. It won’t be addressed in a short amount of time – it will take time, money and investment.”

Computer Weekly raised the points made by Kirby and Brown with BA, and was told that the company is in the process of conducting a thorough review.

“It was not an IT issue – it was a power issue,” a BA spokesperson said. “We know what happened, and we are investigating why it happened. ”

British Airways IT outage: What went wrong with its datacentre?

BA has blamed “human error” for its bank holiday datacentre outage, but the Uptime Institute suggests there may be more to it than that

Major damage to servers

Read more about the BA outage

Thorough review

Read more on Infrastructure-as-a-Service (IaaS)

BA sets out plan to revamp IT, add AI and update website

British Airways passengers suffer flight delays due to another IT glitch affecting London Heathrow

British Airways outage: Airline cancels weekend short-haul flights due to ‘technical issues’

IT systems failure blights UK Border Force electronic passport gates