bloomicon - stock.adobe.com

Visa reveals 'rare' datacentre switch fault as root cause of June 2018 outage

Visa has offered a retrospective analysis of what went wrong in its datacentre during its UK-wide outage on Friday 1 June, in response to a request from the Treasury Select Committee for more detail about the downtime

This article can also be found in the Premium Editorial Download: Computer Weekly: Visa reveals causes of payment chaos

Visa has revealed a “rare defect” in a datacentre switch is what stopped millions of credit card transactions from being carried out during its UK-wide outage on Friday 1 June, in a letter to the Treasury Select Committee.

The Committee is understood to have contacted the credit card payments firm, seeking both clarification over the cause of the outage and assurances about what action Visa is taking to prevent a repeat of it occurring at a later date.

Over the course of the 11-page missive, Visa expands on its previous explanation of a “hardware failure” being the cause of the 10-hour outage by laying the blame on a defective switch in its primary UK datacentre, which – in turn – delayed its secondary datacentre from taking over the load.

The primary and secondary datacentre are setup so that either one has sufficient redundant capacity to process all the Visa transactions that take place across Europe should a fault occur, and the systems are tightly synchronised to ensure this can happen at a moment’s notice.

“Each datacentre includes two core switches – a primary switch and a secondary switch. If the primary switch fails, in normal operation the backup switch would take over,” the letter reads.

“In this instance, a component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.”

This, in turn, meant it took longer than intended to isolate the primary datacentre and activate the backup systems that should allow its secondary site to assume responsibility for handling all of the credit card transactions taking place at that time.

Read more about datacentre outages

The firm’s UK datacentre operations team were alerted to the faulted switch at 2.35pm on Friday 1 June, after noting a “partial degradation” in the performance of the company’s processing system, before initiating its “critical incident” response protocols, the letter continues.

“It took until approximately 19:10 to fully deactivate the system causing the transaction failures at the primary datacentre,” the letter continues.

“By that time, the secondary data centre had begun processing almost all transactions normally. The impact was largely resolved by 20:15, and we were processing at normal service levels in both datacentres by Saturday morning at 00:45, and have been since that time.”

Visa is also quick to point out that at no point during the incident did a “full system outage” occur, but admits the percentage of transactions that were processed successfully did fluctuate, with peak periods of disruption occurring between 3.05-3.15pm and again between 5.40pm-6.30pm.

During these times, around 35% of attempted card transactions failed, but this failure rate dropped outside of these periods to 7%.

“Over the course of the entire incident, 91% of transactions of UK cardholders processed normally; approximately 9% of those transactions failed to process on the cardholders’ first attempt,” the letter continues.

Failed transactions

In total, 51.2m Visa transactions were initiated during the outage, and 5.2m failed to go through.

Since the outage resolved, Visa said it has focused its efforts on preventing a repeat of the events of 1 June, but admits it is still not clear on why the offending switch failed when it did.

“We removed components of the switch that malfunctioned and replaced them with new components provided to us by the manufacturer,” the company said.

It is also working with its hardware manufacturer to conduct a “forensics analysis” of the faulty switch, Visa added, and undertaking a “rigorous” internal review of its processes.

“We are working internally to develop and install other new capabilities that would allow us to isolate and remove a failing component from the processing environment in a more automated and timely manner,” it said.

“Bringing in an independent third party to ensure we fully understand and embrace lessons to be learned from this incident.”

Read more on Datacentre disaster recovery and security

CIO
Security
Networking
Data Center
Data Management
Close