AWS claims human error to blame for US cloud storage outage

Cloud services giant says an input error by an engineer is what led to large numbers of users being unable to use its cloud storage services for several hours on Tuesday 28 February

Caroline Donnelly, Senior Editor, UK

Published: 03 Mar 2017 12:15

Amazon Web Services (AWS) says human error caused the cloud storage system outage, which lasted several hours and affected thousands of customers earlier this week.

Amazon’s Simple Storage Service (S3), which provides backend support for websites, applications and other cloud services, ran into technical difficulties on the morning of Tuesday 28 February in the US, returning error messages to those trying to use it.

The cloud service giant revealed the cause in a post-mortem-style blog post, and explained the issue can be traced back to some exploratory work its engineers were doing to establish why the S3 billing system was performing so slowly.

During this process, a number of servers – providing underlying support for two S3 subsystems – were accidently removed, requiring a full restart, which caused the problems.

“An authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said the blog post.

“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

This affected instances of S3 run out of the firm’s US East-1 datacentre region in Virginia, US, causing havoc for a number of high-profile websites and service providers, including the cloud-based collaboration platform, Box, and instant and group messaging site, Slack.

AWS platforms built on resilience

AWS, however, goes on to say its platforms are built to be highly resilient, but the full-scale restart of S3 took much longer than anticipated.

“We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes,” said the post.

“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

“S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” it added.

The incident has prompted AWS to re-evaluate the setup of its S3 infrastructure, the blog post continues, to prevent similar incidents from occurring in future.

“We want to apologise for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further,” it concluded.

AWS claims human error to blame for US cloud storage outage

Cloud services giant says an input error by an engineer is what led to large numbers of users being unable to use its cloud storage services for several hours on Tuesday 28 February

Read more about cloud outages

AWS platforms built on resilience

Read more on Infrastructure-as-a-Service (IaaS)

WatchTowr warns abandoned S3 buckets pose supply chain risk

8 largest IT outages in history

Green coding - MinIO: An unlikely problem in 'modern' software environments

Compare high availability vs. fault tolerance in AWS