vladimircaribb - Fotolia
Amazon Web Services (AWS) says human error caused the cloud storage system outage, which lasted several hours and affected thousands of customers earlier this week.
Amazon’s Simple Storage Service (S3), which provides backend support for websites, applications and other cloud services, ran into technical difficulties on the morning of Tuesday 28 February in the US, returning error messages to those trying to use it.
The cloud service giant revealed the cause in a post-mortem-style blog post, and explained the issue can be traced back to some exploratory work its engineers were doing to establish why the S3 billing system was performing so slowly.
During this process, a number of servers – providing underlying support for two S3 subsystems – were accidently removed, requiring a full restart, which caused the problems.
“An authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said the blog post.
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
This affected instances of S3 run out of the firm’s US East-1 datacentre region in Virginia, US, causing havoc for a number of high-profile websites and service providers, including the cloud-based collaboration platform, Box, and instant and group messaging site, Slack.
Read more about cloud outages
- Amazon Web Services cloud storage service experienced technical difficulties in the US overnight, which had knock-on effects for a number of high-profile websites and service providers.
- Insurance brokers have accused SSP Worldwide of withholding information about the cause of the cloud outages that have blighted its Pure Broking service during three out of four of the past working days.
The outage also had a knock-on impact on a number of AWS services, hosted from US East-1, that rely on S3 for backend support, including Amazon Elastic Computer Cloud (EC2), AWS Elastic Block Store, and AWS Lambda.
It also caused the AWS service status page to stop working, causing problems for users keen to find out when the firm’s systems would be back up and running again.
The downtime has promoted numerous industry commentators to speak up about the risks involved with running a business off the infrastructure of a single cloud provider, while others have seized on it to reinforce the importance of having a robust business continuity strategy in place.
AWS platforms built on resilience
AWS, however, goes on to say its platforms are built to be highly resilient, but the full-scale restart of S3 took much longer than anticipated.
“We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes,” said the post.
“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.
“S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” it added.
The incident has prompted AWS to re-evaluate the setup of its S3 infrastructure, the blog post continues, to prevent similar incidents from occurring in future.
“We want to apologise for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further,” it concluded.