Kaspars Grinvalds - stock.adobe.

News

Downtime deterrent: Trustpilot SRE on using infrastructure as code to prevent site outages

Online reviews site Trustpilot has a huge global readership, and here its site reliability engineering manager shares details of how his team help to ensure its services are always on when its users need them

Caroline Donnelly, Senior Editor, UK

Published: 08 Jan 2019 10:30

With a user base that is not shy about telling companies what it thinks of how they operate, the technology team at business review website Trustpilot are only too aware that recurring and prolonged periods of downtime are unlikely to be tolerated.

Not by the consumers that rely on the site to guide their purchases, nor by the businesses that use the feedback within its published reviews to hone and improve the services they provide to the general public.

“Our end product is digital – it is a software-as-a-service (SaaS) product, so if we are down, our customers cannot use the site, so we have to be up at all times – that is crucial,” Morten Reinholdt Boelskifte, site reliability engineering (SRE) manager at Trustpilot, tells Computer Weekly.

According to the firm’s own user statistics, the site is home to 45 million user-generated reviews of 230,000 companies, which are read more than three billion times a month by consumers around the world.

“We are aiming for four nines of uptime and availability,” Boelskifte adds. “Do we hit it every single month? Most of the time, yes.”

Preparing for failure

As is often the case when mitigating downtime risks, preparation is key, and an example of that is how Trustpilot’s technology teams ready themselves for two of their biggest user traffic-generating events of the year – Black Friday and Cyber Monday.

“What we build for is current [demand] times ten,” says Boelskifte. “That’s the kind of ballpark [our infrastructure] has to be able to support.

“Every time it comes around, we do preparations for that within the team, and with all the feature teams by ensuring the individual site features they are responsible for will be ready to cope with the demand expected when Black Friday hits.

“Sometimes it can be in terms of double-checking that their architectures are set up correctly, and it might involve load balancing and testing services to pinpoint potential weak spots.”

Trustpilot operates a cloud-based, microservices-based architecture that comprises 600 individual services that, in turn, control 300 or so functions across the site.

“We are multi-cloud, because we are in both Amazon Web Services [AWS] and the Google Cloud Platform [GCP],” says Boelskifte. “We have multiple regions and environments.

“We are predominantly in AWS, but most of our big data is stored in GCP and also our machine learning models, because they use our big data to run queries on. Most of the consumer-facing part of Trustpilot is in AWS.”

Each site feature is taken care of by a distinct team of software engineers, who all operate on a “full stack ownership” basis, and are collectively responsible for deploying production code changes about 200 times a week.

“These can be bug fixes but also feature requests, and we can rapidly run through the changes, so that if we do a deployment that doesn’t succeed or work the way we thought, we can quickly implement a fix,” he says.

Introducing infrastructure as code

With such a high throughput of changes undertaken within an infrastructure containing so many moving parts, Boelskifte says there are tried and tested procedures in place to minimise the risk of disruption resulting from rogue code making it into production.

“We also have an unwritten rule in place that we are not deploying on a Friday, because it could be more of a hit and run, because if something goes wrong, it could affect the site over the weekend too,” he says.

The SRE team has also embraced automation and the principle of Infrastructure as code (IaC) to help simplify and streamline certain processes – so much so that some of its feature teams are also following suit.

This means it has shifted away from relying on manual processes to provision and manage the technology stacks underlying its services by moving towards a more software-defined setup.

“When we started out, there was a bit of a learning curve, and the output of the team declined slightly,” says Boelskifte. “But with infrastructure as code, you can say the benefit you get is the sum of investment you add to it. So the more you invest in it, the more you can take back from it.”

As an example of the benefits IaC has brought to its SRE teams, Boelskifte cites the work that recently went into overhauling the firm’s escalation policies, which dictate how assigning responsibility for resolving downtime incidents should be managed.

Following in the footsteps

For organisations looking to follow in Trustpilot’s footsteps and with their similarly high uptime expectations to meet, Boelskifte says a move towards IaC should be a top priority.

“For us, it was something my team were really keen to do and we made the decision together to go down this route,” he says. “It was not easy to roll out, because there is a learning curve as you get to grips with the language and how it affects the infrastructure.

“It depends on how you as an organisation choose to embrace it. You can try to do everything yourself, relying on a programming language like Python, or you can opt into an already established setup like Terraform’s infrastructure as code software.

“There will still be a learning curve in figuring out how the system reacts when you do x, y and z. Here we treat IaC as regular software, so you have to test that what you say is happening, actually does.”

Downtime deterrent: Trustpilot SRE on using infrastructure as code to prevent site outages

Online reviews site Trustpilot has a huge global readership, and here its site reliability engineering manager shares details of how his team help to ensure its services are always on when its users need them

Preparing for failure

Introducing infrastructure as code

Read more about DevOps and on-call

Following in the footsteps

Read more on DevOps

SRE vs. DevOps: What's the difference?

site reliability engineering (SRE)

The Security Interviews: Building trust online

An introduction to SRE documentation best practices