Ruslan Grumble - Fotolia
Sleep can become very hard to come by when you end up on the hook for after-hours maintenance on hundreds of microservices at a major consumer site.
Sleep deprivation was just one of the problems faced by Andrew Rampling, a DevOps engineer for Carsales.com.au, Australia’s leading website for selling cars, motorcycles, boats and caravans, when he began the on-call role at the firm.
At the time, Carsales, which stocks more than 200,000 cars available for purchase and handles about four million unique users a month, did not have an incident management alert system.
The on-call role fell to one operations person who was responsible for ensuring the reliability, availability, security and performance of over 500 microservices-based applications, said Rampling.
But managing outages across all those services was tricky after hours.
“It really did require some heroics to pull that off night after night, up to seven nights in a row until it was the next on-call responders turn,” Rampling told a PagerDuty event in Sydney in June.
Peak user traffic on Carsales arrives after dinner between 7pm and 10pm. The site gets about 500,000 searches per hour.
“If something does go wrong, we need to be alerted immediately or Carsales starts to lose money,” said Rampling.
The problem for Carsales was that alerts for failing applications did not necessarily match with where the real snag lay. It would often take an on-call operator some time to work out whether he or she could handle the incident, or whether it needed to be pushed on to the team in charge of a particular service such as “sign-in”.
Also, the sheer number of alerts and reporting systems hooked into the Carsales’ platform tended to create noise instead of clarity.
“We had so many alerting systems,” said Rampling. “You had to keep tabs on emails and SMSes, there were Slack notifications, phone calls from customer service, and for due diligence you should also be logging into the administration console every hour or so to make sure the queues are behaving and under control.”
Rampling described being woken up every hour by alerts, most of them false. It would take about fifteen minutes to discover whether an alert was real or not. “After being woken up nine times in a night, I felt like trash,” he added.
Andrew Rampling, Carsales
He discussed the nightly alert tsunami with his manager who was already in discussions with incident response platform provider PagerDuty.
A PagerDuty trial within the operations team ensued. This went well and Carsales’ DevOps managers approached the company CIO about a general roll-out that was subsequently approved.
Rampling remembers well his first on-call night with PagerDuty in place: “There’s no on-call entry for that night. I didn’t get woken up at all. It was a great night’s sleep with zero false alarms.”
With PagerDuty on board, Carsales changed the way alerts were handled. Instead of every alert going straight to on-call operations staff, alerts could be sent directly to the teams responsible for a particular application.
“For example, if there was an alert for the sign-in application, that alert was going straight to the membership team – it wasn’t going to the ops team first. If it was a core services application, it was going directly to that team.”
This eliminated process blockage where an on-call operator had to figure out whether a problem could be fixed or needed to be escalated to a specialist team.
Metrics showed the PagerDuty initiative was paying off. In the first month with PagerDuty switched on, Carsales’ mean time to acknowledge (MTTA) for a problem was two hours. “Over six months, it was dropping by about twenty minutes per month until it got to a MTTA of two minutes,” said Rampling.
The number of incidents fell dramatically. In November 2018, the firm had 578 high urgency incidents. By May, 2019 this had dropped to 225 incidents.
PagerDuty is a software-as-a-service (SaaS) product that can be delivered through the cloud or integrated into a client’s systems. The company was founded in 2009 and listed on the New York Stock Exchange in April 2019.
“Delivering an always-on experience, no matter the industry, is getting harder and harder to do,” said Jonathan Rende, PagerDuty’s senior vice-president of products and marketing. “User expectations are at an all-time high.”
Rende said the hundreds, or even thousands, of services that underpin modern applications all have people behind them to keep them running. “Orchestrating them is an issue and understanding the upstream and downstream implications is impossible without some level of intelligence,” he said.
PagerDuty uses artificial intelligence (AI) and machine learning to aggregate and sort the telemetries streaming in from the plethora of management and monitoring agents and tools in the modern software stack and nail down exactly which piece is failing and who can fix it.
“By pulling that altogether, we view ourselves as a central nervous system. Then, we can orchestrate these teams in a more efficient way with the context they need to be more effective at fixing the problems that are coming their way,” Rende said.
Some PagerDuty use cases are moving outside of traditional IT operations. San Francisco-based online grocery delivery firm Good Eggs uses PagerDuty to monitor its refrigeration systems. “When any kind of temperature issues emerge, the right teams are deployed to address that before there is any spoilage,” said Rende.
On what is next for PagerDuty, Rende said one of the big opportunities is to figure out how it can deliver more context, so people can understand the dependencies upstream and downstream from them.
“Is this change just in my world, my swim lane, or is related to what other people are doing so we can act in a coordinated fashion? This notion of how do I know not only my world, but outside of my world so that I can be more effective – that is a problem that has not been solved in the world of DevOps and complex systems,” he added.
Read more about IT incident management
- No one notices when there’s a solid IT incident management response process in place, but everyone notices when there’s not. Improve how you handle issues in three steps.
- When Experian deployed xMatters, the IT landscape was very different. But the incident management tool’s adaptability kept it useful through shifts in tech trends.
- IT admins and gurus can’t tackle every issue every hour of the day – let alone predict them first. A dynamic combination of monitoring and incident management makes for better admins.
- Unless the datacentre floods or a digger cuts a network cable, incident management takes place via software tools. The quality of these tools determines how quickly work can resume.