
Romolo Tavani - stock.adobe.com
AWS apologises for 14-hour outage and sets out causes of US datacentre region downtime
Days after its largest US datacentre region experienced a lengthy outage, causing service disruption to web users across the globe, Amazon Web Services issues an explanation and apology regarding what happened
Amazon Web Services (AWS) has issued an apology to its customers inconvenienced by its largest US datacentre region suffering a 14-hour outage on 20 October, in a blog detailing the precise nature of the technical difficulties its services suffered.
As previously reported by Computer Weekly, the outage originated in the public cloud giant’s US-East-1 datacentre region in North Virginia, and caused large-scale disruption to a host of companies across the world, including in the UK.
Social media and communications services such as Snapchat and Signal suffered disruption to their services, as did Amazon-owned internet entities such as its retail site, Ring doorbell and Alexa services.
Financial services provider Lloyds Bank Group, along with its Halifax and Royal Bank of Scotland subsidiaries, and the government tax collection agency HM Revenue and Customs, were also affected in the UK by the outage.
As a result, HM Treasury is now facing calls to give an account as to why – given its role as a major supplier of cloud services to the UK financial services sector – AWS has not been called into scope of its Critical Third Parties (CTP) regime before now.
The initiative gives HM Treasury powers to designate suppliers to the financial services sector as being CTP, meaning their activities can be brought into the supervisory scope of the UK’s various financial regulators.
The intention being that doing so might help better manage any potential risks to the stability and resilience of the UK financial system that might arise as a result of a third-party supplier suffering from service disruption, as happened with AWS this week.
The company has now published an extensive post-event summary document, which confirms the outage occurred in three distinct phases as a result of issues occurring within several parts of its infrastructure.
As such, the company said that just before 8am UK time on 20 October, its fully managed, serverless, NoSQL database offering Amazon DynamoDB began to experience increased application programming interface (API) error rates, which lasted for just under three hours.
Then, from around 1pm UK time on 20 October, some of the network load balancers (NLB) within its US-East-1 region started to experience increased connection errors, which persisted until around 10pm the same day. “This was caused by health check failures in the NLB fleet, which resulted in increased connection errors,” the summary document stated.
In addition to this, AWS said issues occurred when attempts were made to launch instances of its Elastic Cloud Compute (EC2) virtual servers, which is an issue that persisted from around 10.30am on 20 October UK time until 6.30pm.
“New EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM PDT [6.37pm UK time], some newly launched instances experienced connectivity issues which were resolved by 1:50 PM [9.50pm UK time],” the summary document continued.
It also confirmed that other AWS services hosted within US-East-1 suffered knock-on effects as a result of the issues experienced by DynamoDB, EC2 and its network loan balancing setup.
“We are making several changes as a result of this operational event,” the company said. “As we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery.”
The company then concluded the summary document with an apology to any customers affected by the outage.
“While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses,” said the summary document. “We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”
Read more about the AWS outage
- Amazon Web Services (AWS) users have experienced an outage after the public cloud giant's North Virginia datacentre region became beset with technical difficulties.
- The government is facing questions about why a multi-hour outage originating in a US-based cloud region belonging to Amazon Web Services caused service disruption to UK banking giants and HM Revenue and Customs.
Read more on Infrastructure-as-a-Service (IaaS)
-
Government faces questions about why US AWS outage disrupted UK tax office and banking firms
-
AWS confirms it is working to 'fully restore' services after major outage
-
CMA told to expedite action against AWS and Microsoft to rebalance UK cloud market
-
UK flights suspended after air traffic control outage