Gatwick Airport

British Airways confirms investigation under way into bank holiday datacentre power failure

Airline says investigation will focus on ensuring the IT problems of the bank holiday weekend will never be repeated

British Airways (BA) is embarking on an “exhaustive investigation” to establish the root cause of the datacentre power failure that grounded flights at Heathrow and Gatwick over the bank holiday weekend.

As previously reported by Computer Weekly, the power failure caused the airline’s check-in, baggage handling, booking and contact centre systems to fail on Saturday 27 May, resulting in the majority of flights from both airports being cancelled for two days.  

In a statement to Computer Weekly, a BA spokesperson said the company knows what happened, but is now in the process of establishing why.  

“There was a loss of power to the UK datacentre which was compounded by the uncontrolled return of power, leading to a power surge taking out our IT systems,” the statement said.

“We are undertaking an exhaustive investigation to find out the exact circumstances and, most importantly, ensure this can never happen again.”

In the days that followed the disruption, the GMB union issued a statement of its own, citing the airline’s decision to outsource some of its IT function to India in 2016 as a causal factor.

The BA statement rejected the GMB’s claims, saying its decision to outsource its IT requirements played no part in the incident.

“It was not an IT failure and had nothing to do with outsourcing of IT. It was an electrical power supply which was interrupted,” the statement said.

A report in The Telegraph, featuring input from unnamed sources, shed a little more light on the situation. It cited the failure of a defective uninterruptible power supply (UPS) within one of the airline’s two Heathrow-based datacentres as being to blame.

Power to the site was initially lost at 8.30am on Saturday, and should have been restored – if the UPS had been working correctly – in a controlled fashion.

Read more about BA and IT outages

But, as confirmed in the BA statement, quite the opposite occurred, leading to – as The Telegraph’s sources called it – “catastrophic physical damage” being done to the airline’s servers.

Speaking to Computer Weekly, Andy Lawrence, vice-president of research for datacentre technologies and eco-efficient IT, said what makes the case all the more puzzling is that most datacentres are designed to cope with problems of this nature.

“Some systems in the power chain clearly failed to perform as expected,” he said.

What will be interesting to know, once BA’s investigation concludes, is why so many of the airline’s systems were affected, said Lawrence.

“It is clear that BA has been grappling with several problems, starting with the power supplies, but extending to the network/messaging systems, and to the database/application design,” he said.

“Recovering from all these issues, when they extend across multiple teams and involve multiple contractors, is challenging.”

The move away from monolithic application architectures could have been a factor, he said, contributing to each of these systems developing multiple external dependencies as they have changed and evolved over time.   

“All this calls for a distributed resiliency strategy that ensures applications can adequately deal with partial failures and incomplete data,” said Lawrence.

“Equally, backup and critical interrelated systems may need to be both electrically and logically separate from each other.

“If they are run from the same datacentre, this datacentre needs to be extremely well planned and run, to decrease the risk from site-wide failures.”

Read more on Datacentre backup power and power distribution