AWS DevOps Guru offers transcendental remediation

Amazon Web Services, Inc. (AWS), is, just in case you hadn’t noticed, a big company.

As such, it can pretty much name its products however it likes… and Amazon DevOps Guru is a case in point.

Not quite a human yogi, shaman or master of knowledge, this is in fact a fully managed operations service that uses machine learning (ML) so developers can improve application availability.

It works by automatically detecting operational issues and recommending specific actions for remediation.

This technology examines application metrics, logs, events and traces for behaviours that deviate from normal operating patterns.

When Amazon DevOps Guru identifies anomalous application behavior that could cause potential outages or service disruptions, it alerts cloud developers with issue details to help them understand the potential impact and likely causes of the issue, with specific recommendations for remediation.

Remediation situation

Developers can use remediation suggestions from Amazon DevOps Guru to reduce time to resolution when issues arise and improve application availability.

There is no manual setup or machine learning expertise required and customers pay only for the data Amazon DevOps Guru analyses.

The company says that as more organisations move to cloud-based application deployment and microservice architectures to scale their businesses, applications have become increasingly distributed and developers need more automated practices to maintain application availability and reduce the time and effort spent detecting, debugging and resolving operational issues.

But what causes those downtime operational issues?

Application downtime events can be caused by faulty code or config changes, unbalanced container clusters, or resource exhaustion (e.g. CPU, memory, disk, etc.) that inevitably lead to bad experiences and lost revenue.

So what have we been doing traditionally?

A lot of work in this area in the past has seen companies invest a considerable amount of developer resources, time and money to deploy multiple monitoring tools (which are often managed separately) and then have to develop and maintain custom alerts for common issues like spikes in load balancer errors or drops in application request rates.

Setting thresholds to identify and alert when application resources are behaving abnormally is difficult to get right, involves manual setup, and requires thresholds that must be continually updated as application usage changes (e.g. an unusually large number of requests during a sales promotion).

If a threshold is set too high, developers don’t see alarms until operational performance is severely impacted. When a threshold is set too low, developers get too many false positives, which they are prone to ignore.

Even when developers get alerted to a potential operational issue, the process of identifying the root cause can still prove difficult. Using existing tools, developers often have difficulty triangulating the root cause of an operational issue from graphs and alarms.

Even when they are able to find the root cause, they are often left without the right information to fix it.

Amazon DevOps Guru’s machine learning models leverage over 20 years of operational expertise in building, scaling, and maintaining highly available applications for Amazon.com.

Guru learnings

The ‘guru’ uses a pre-trained machine learning model to identify deviations from an established baseline (e.g. under-provisioned compute capacity, database I/O utilization, memory leaks, etc.).

It also correlates and groups related application and infrastructure metrics (e.g. web application latency spikes, running out of disk space, bad code deployments, etc.) to reduce redundant alarms and help focus users on high-severity issues.

To help resolve issues, Amazon DevOps Guru provides recommendations with remediation steps and integrates with AWS Systems Manager for runbook and collaboration tooling, giving users the ability to more effectively maintain applications and manage infrastructure for their deployments.

Swami Sivasubramanian, vice president of Amazon Machine Learning at AWS says that wth a few clicks in the AWS Management Console, users can enable Amazon DevOps Guru to begin analyzing account and application activity within minutes.

Together with Amazon CodeGuru—a developer tool powered by machine learning that provides intelligent recommendations for improving code quality and identifying an application’s most expensive lines of code—Amazon DevOps Guru provides customers the automated benefits of machine learning for their operational data so that developers can more easily improve application availability and reliability.

Image source: Wikipedia