olly - Fotolia

Amazon cloud crash forced Australian enterprises to take heads out of sand

Enterprises in Australia need to rethink their cloud backup strategies following the recent AWS outage in Sydney

This article can also be found in the Premium Editorial Download: CW ANZ: CW ANZ: August 2016

In early June, Amazon Web Services’ (AWS) Sydney-based cloud was unavailable for up to 10 hours for some customers after power to the datacentre was cut during a storm.

The event demonstrated that even the world’s largest cloud computing platforms are vulnerable to periodic failure, which means enterprise cloud users must still consider business continuity planning – particularly for mission-critical applications.

The AWS cloud is used by Australian organisations such as the Commonwealth Bank, accounting software business MYOB and ad trading platform Brandscreen, and to host the popular consumer game, Fruit Ninja.

While Fruit Ninja players might have been momentarily frustrated by being unable to blow up a banana, business service disruption is far more serious, and the incident served as a reminder that enterprises cannot ignore business continuity planning even if they have signed up for cloud.

Amazon’s service health website, which tracks the performance of the cloud, shows that the EC2 instance on the Sydney cloud was down for about two hours during the storm, with knock-on effects for other Amazon cloud services, such as Redshift, Elastic Beanstalk, the Storage Gateway and Cloud Formation. After 10 hours, most of the issues had been resolved.

Five days after the outage, Amazon released a post mortem of the incident, which said that the electricity substation feeding the datacentre was blacked out in the storm and AWS’s uninterruptible power supply failed.

Even after power was restored, a software bug in AWS’s instance management software meant that recovery was slower than predicted for customers.

AWS has apologised to customers for the inconvenience and is now overhauling its power supply infrastructure and software to reduce the chances of it happening again.

But it also noted: “For this event, customers that were running their applications across multiple availability zones in the region were able to maintain availability throughout the event. For customers that need the highest availability for their applications, we continue to recommend running applications with this architecture.”

But that would not have been an option for customers with data sovereignty concerns because AWS does not have datacentres in multiple locations in Australia, only in the Sydney area.

Ostrich approach

According to Gartner research director Olive Huang, too many companies take an “ostrich approach” to cloud business continuity, expecting their cloud suppliers to take care of that side of the house. “You can have redundancy, but it costs money,” she said. “People go to the public cloud very ill-prepared.”

Huang said that although IT departments running in-house systems might have business continuity and disaster recovery plans designed around the degree of systems failure a business could tolerate, that was often lacking when companies bought cloud.

The problem is compounded because cloud services are often bought not by IT, but by the business, so much less thought is put into business continuity, she said. “Only when these things happen, someone needs to clean up,” she added.

Alan Trefler, founder and CEO of Pegasystems, believes cloud computing has an important role to play in many enterprises – Pegacloud runs on the Amazon cloud – but warns that it is not a panacea for all business needs.

Speaking to Computer Weekly at Pegaworld in Las Vegas, he said: “People use the cloud in all sorts of bizarre and fantastic – not in a good way – ways.”

Trefler warned that although cloud had a role – particularly hybrid solutions blending private, public clouds and on-premises solutions – it should not be considered a panacea for effective computing strategies.

He said that when determining where to locate computing workloads, “a lot of it depends on the consequence of failure”.

So for Netflix, for example, a cloud failure might require people to restream videos, but it is not an insurmountable problem. But for medical applications that link health monitoring devices to a care management platform, loss of cloud connectivity could be a life or death issue, said Trefler.

Read more about IT innovation in Australia

Pegasystems has 100 users of its Pegacloud internationally, 10 of which are based in Australia and include banks and governments, according to Scott Leader, managing director of Pegasystems ANZ. When the AWS Sydney cloud went down, “there was some impact, but it was resolved quickly”, he said.

Julian Anderson, head of digital innovation and IT strategy for insurance giant QBE, is currently piloting the Pegacloud in some of its emerging markets. “We have had outages before, but generally the infrastructure is stable,” he said. “It would be more of a problem for a bank rather than an insurance company.

“One thing we would look to do is have a version of the app online so that people can continue to work without cloud connection, and then back up the transactions later.”

This raises a second business continuity issue. Even if the underlying cloud platform is working, unless communications networks are available, consumers, partners and remote workers will not be able to access it. For example, Telstra’s recent network collapse left many people unable to connect to cloud services for hours.

Gartner’s Huang added: “If the ship cuts the cable, you are screwed – and that has happened.”

Paid less attention

The reason enterprises paid less attention to business continuity planning when using cloud rather than in-house systems was partly “because of the cloud supplier marketing machine and the thought that they are so big that if they do go down, they should be back up quickly”, Huang added.

For companies that are researching the reliability of public cloud, CloudHarmony offers a dashboard for various suppliers and regions around the world. The service gives a helicopter view of the quality of cloud currently on offer – but cannot predict one-off events such as Sydney’s storm and its knock-on effects for AWS users.

Amazon has promised to raise its game, however.

Its post mortem concluded with an apology and noted: “We know how critical our services are to our customers’ businesses. We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services.” 

Read more on Disaster recovery