vege -

How Tokopedia is streamlining incident management

Indonesian e-commerce giant Tokopedia has improved incident management and developer productivity using a cloud-based incident management tool

At Tokopedia, one of Indonesia’s largest unicorns, digital transformation is more than just digitising manual processes and embracing newfangled technologies. After all, the e-commerce giant was born digital in 2009, and has since added digital logistics and finance services to its array of products.

As its product portfolio and customer base grew, it became clear that there was a need to relook its technology stack to keep pace with a business that serves more than 9.2 million merchants and entrepreneurs.

For a start, it moved away from monolithic applications to a more scalable microservices-based architecture that delivers containerised applications. Its engineering teams spread across Jakarta, Singapore and India also built an incident management tool to address IT issues as they occur.

But maintaining in-house management tools takes engineering resources away from its focus on developing applications to solve customer problems and improve customer experience. That’s when it decided to look for a solution outside.

It came across PagerDuty, a cloud-based incident response platform, and found that it “gets incident management implemented in a smooth way”, according to Rajesh Gopala Krishnan, associate vice-president for engineering productivity at Tokopedia.

Krishnan said after rolling out PagerDuty for five of Tokopedia’s services in a proof-of-concept project, the company saw dramatic improvements in service performance indicators such as mean time to repair (MTTR) and decided to scale up the deployment for over 300 services.

At the same time, Krishnan’s team was able to customise PagerDuty to Tokopedia’s incident management workflows, which were further streamlined to ensure integration with other tools the engineers were using.

Read more about IT incident management

Today, Tokopedia’s engineers are equipped with the PagerDuty app which will group related alerts into a single incident, alleviating the need to make sense of the sea of alerts. “Rather than deal with scattered noise, we just go to one place to get the details of an incident,” said Krishnan.

In addition, PagerDuty can also escalate an incident to the right person to resolve based on Tokopedia’s escalation policy. With its in-house incident management system, Tokopedia engineers had to manually look up who that person was before reaching out to him or her.

And after an incident is resolved, the tool will collate all information about the incident, including when and where it occurred, which services were affected and the person who first looked at it.

“All of that had to be manually captured previously, but it is now automatically pushed into our ticketing system,” said Krishnan. “So, when we try to do a root cause analysis in the later stages, it’s a matter of looking up the ticket.”

To reduce the false positive rate, Krishnan’s team adjusts the tolerance level for alerts four to five times a week based on a slew of observability metrics, as well as the dependency between services.

Krishnan said the productivity of the engineering team has improved as they can now avoid working on the same IT issues repeatedly. And through root cause analyses of the incidents, Tokopedia has been able to improve the quality of its products.

PagerDuty is also being used to monitor the performance of new features that are being rolled out to a small group of users, said Krishnan, adding that this will help to identity anomalies and problems that need to be fixed before the features are deployed to Tokopedia’s entire user base.

Read more on IT operations management and IT support

Data Center
Data Management