weerapat1003 - stock.adobe.com

AWS outage: Downtime incident blights users of one of Amazon’s major US datacentre regions

Public cloud giant suffers prolonged service outage due to fault originating in one of its major US datacentre regions

Amazon Web Services (AWS) users are awaiting a full explanation from the public cloud giant about the cause of a prolonged outage at one of its major US datacentre regions that began on Wednesday 25 November 2020, US time.

The source of the downtime incident is known to have originated within the company’s US-East-1 datacentre region, and caused by a defect in the application programming interface (API) of its real-time data-streaming service, Kinesis Data Streams (KDS).

The issue is known to have blighted the usability of number of high-profile internet services that rely on KDS during the incident, many of whom used the social networking site Twitter to confirm themselves as affected by the downtime issue. One said:

“An Amazon AWS outage is currently impacting Adobe Spark, so you may be having issues accessing/editing your projects. We are actively working with AWS and will report when the issue has subsided. https://t.co/uoHPf44HjL for current Spark status. We apologize for any inconvenience! – Adobe Spark (@AdobeSpark) November 25, 2020.”

The outage has also served to highlight the interdependencies that exist within the wider AWS portfolio, as the issues encountered by the KDS API are known to have negatively affected the performance of a number of other AWS services that rely on it to work. 

The company’s cloud service status pages makes reference to other “dependent services” being affected by the outage, which AWS first acknowledged around 2am GMT time on Thursday 26 November.

For example, respondents to the AWS Support Twitter feed reported issues with its code building and test offering, Code Pipeline, its infrastructure monitoring service, Amazon Cloudwatch, and – at one point during the outage – the service status page was also unavailable.

At the time of writing, the AWS service status dashboard confirmed that the company had resolved the issue, and service had been restored to all of the affected parts of the AWS portfolio, but no further details have been given at this time about the circumstances that led to the outage occurring in the first place.

“We have identified the root cause of the Kinesis Data Streams event, and have completed immediate actions to prevent recurrence. Kinesis and CloudWatch are operating normally,” said a statement on the AWS Service Status page, published just after 9am GMT today.

Read more about cloud outages

Liz Beavers, head geek at IT monitoring software provider SolarWinds, said the scale of the outage suggests AWS’s outage management strategies leave a lot to be desired.

“Without strong incident and problem management strategies in place, we see widespread outages with a high impact like the one today from AWS,” she said. “With many different units and customers interconnected through the AWS platform, it is crucial that Amazon partners have an IT service desk strategy for streamlining and resolving repeat incidents, which typically occur with a large IT outage like this one.

“Part of having a strategic service desk response to an outage is also equipping IT teams with a singular communication channel to publicise the known issue across the organisation. Not only does this help contextualise the full impact of the problem, it enables IT to troubleshoot more effectively and in some cases publish documentation for potential workarounds.”

Mike Kiersey, principal technologist at Dell Technologies-owned integration platform-as-a-service (PaaS) provider Boomi, said the incident highlights just how dependent large parts of the digital economy are on the need for real-time streaming data.

“The issues affecting Kinesis underline the absolute need to be able to process and manage real-time data,” he said. “If the data stream stops functioning, the fallout can be huge, especially for cloud providers.

“Managing real-time data comes down to effective integration and monitoring, which allows for a seamless transition into a more modernised data fabric network. By having a responsive integrated platform, data points become more accessible, agile and transparent to understand how applications communicate.”

Kiersey added: “Organisations need to consider how they are architecting and integrating the streaming platform into the core fabric of their enterprise architecture, united by master data management which has the potential to cross-departmental and geographic borders.”

Read more on Datacentre backup power and power distribution

CIO
Security
Networking
Data Center
Data Management
Close