Amazon S3 outage: Acknowledging the role humans play in keeping the cloud going

The Amazon cloud storage outage provides a neat reminder about the role humans continue to play in the delivery of online services, but – when things go wrong – end-user sympathy for the plight of the engineers involved is often in short supply, writes Caroline Donnelly.

The internet age has massively inflated end-user expectations around the uptime and availability of online services. So much so, when the platforms we rely on to stream music, send emails or collaborate on work projects fall over, consumer patience is often in short supply.

Evidence of this can be found on Twitter during an outage, and seeing what users have to say about the fact a service they need to use is not available when they expect it to be.

Depending on the nature of the service that has gone down, the tone and content of messages can vary considerably from resigned acceptance to all-out fury, with a few snark-filled, meme-laced barbs often thrown in for good measure.

A couple of years ago, Ahead In the Clouds (AITC) sat through a DevOps presentation at the AWS user conference in Las Vegas about The Day in a Life of a Netflix Engineer.

During the session, Dave Hahn, a senior engineer at Netflix, touched upon the histrionic online outbursts its user base are prone to indulging in whenever the streaming service runs into technical difficulties.

“If any of you have ever monitored social media when there is an occasional Netflix outage, you’ll notice some people believe they’re going to die. I want to let you know, we checked and no-one has actually died,” he said.

While Hahn’s comments were made in jest, they serve as a handy reminder that – while it is annoying when services we rely on fall over, it’s usually relatively short-lived and rarely the end of the world.

Prolonged and widespread

The exception to that, of course, is when the downtime is prolonged, as was the case with SSP Worldwide’s two-week service outage in the summer of 2016, or when the failure of one service has far-reaching implications for many others.

The Amazon Web Services (AWS) cloud storage outage on 28 February is an example of the latter, with its multi-hour downtime drawing attention to just how many people rely on its Simple Storage Service (Amazon S3) to underpin their online services and systems.

According to AWS, the cause of the downtime was a typo, generated by an engineer while inputting a command. This in turn contributed to a larger than expected number of servers (hosted within the firm’s US East-1 datacentre region) falling offline.

During the course of the downtime, and for several days after, Twitter was full of people making light of the situation, and the fact a humble typo could prove so disruptive to the world’s biggest cloud provider.

It seems it is all too easy to forget, or simply overlook, the critical role humans play in the creation, development and delivery of the online services, particularly in light of the column inches regularly devoted to how automation and robotics are changing the way lots of industries operate nowadays.

Whenever an errant server misbehaves, it is still the job of an engineer to respond to the system alert and get to work on the solving the problem, possibly with the assistance (but sometimes not) of their colleagues.

If that call comes in the middle of the night, it is the engineer whose sleep gets disrupted or whose personal life gets put on-hold so they are ready to respond to any incidents that may occur on their watch.

Human error in the Amazon S3 outage

In the case of a company the size of Amazon, the pressure to perform and rectify the problem as quickly as possible will be all the greater, given just how many organisations and people depend on its platforms.

Among all of the social media snark about the Amazon S3 outage was a sizeable number of tweets, indexed under the #HugOps hashtag, taking a whole more empathetic point of view on the situation and the plight of the people tasked with sorting it.

Rather than point fingers and make jokes, people were using the hashtag to wish the AWS engineering team well, and pass on their support for the engineer whose typo reportedly caused it all.

Someone has even created a GoFundMe page for the engineer concerned to raise money for – as the post says – all the “alcohol or therapy, or both” the individual concerned will need to get over what occurred.

“This campaign is intended in the most light-hearted and supportive way possible. It’s not easy to be the root cause of an outage, and this was a big one,” the page reads.

A lot of the people making use of the #HugOps hashtag work in IT, and are sympathetic to the plight of the person involved as they’ve probably had first-hand experience of being in a similar situation themselves.

Which is why so many of the posts sporting the hashtag have an air of “there but for the grace of god go I” about them, but – for users – all they see is the inconvenience caused by not having “always-on” access to their favourite services.

As is the case with on-premise systems, sometimes things just fail or don’t perform the way we think they should, and it is time users grew to appreciate and understand that because internet access is a privilege, not a human right.

And, by ranting and raving online about why something isn’t behaving the way it should is likely to exacerbate an already god-awful situation for someone, somewhere tasked with repairing it.

While letting off some online steam might make you feel better, it’s not going to get what’s broken back up and running any quicker.

So next time an outage occurs, spare a thought for the engineers, beavering away behind the scenes trying to get things up and running again, before you go off on an extended rant at the company on social media.

Put yourself in their shoes. If you went to work and made a mistake that tens of thousands of people on the internet shouted at you about, how would that make you feel?