How Photobox keeps site reliability in the picture

Photobox’s site reliability head discusses how the photo book and personalised gifts site manages a complex microservices architecture

Cliff Saran, Managing Editor

Published: 23 Nov 2022 11:22

Over the past few years, Photobox has been on a journey to unify its e-commerce platform. At the start of 2022, the company merged with Albelli, and, says Alex Hibbitt, director of site reliability engineering at Photobox, hopes to build out a solid base for the different brands in the group.

Photobox’s IT is based on a microservices architecture, running on the Amazon Web Services (AWS) public cloud. Over the Black Friday and Cyber Monday weekend each year, the company’s absolute peak of trading is five to six times its normal activity.

Peak shopping events run over an extended period due to the nature of Photobox’s business. Customers wishing to buy personalised photo-based products, such as books, calendars, prints and gifts, upload digital images to the website and, over an extended period of time, customise the layout of their chosen product, then proceed to the checkout.

This puts significantly more strain on the back-end platforms that run Photobox’s business, compared with other retailers where the customer journey from product selection to checkout occurs in a matter of minutes.

Pulling together puzzle pieces

Monitoring every aspect of the platform is key, but when Hibbitt joined Photobox four years ago, each developer team used its own monitoring tools. “When I joined, we had 10 separate monitoring tools in place,” he says.

In terms of getting an overall view of the reliability of the platform, he says each tool covered an individual part of the full picture, which is one of the challenges of a microservices architecture. “You want to give teams the freedom to pick their tools, but this often can lead to tool proliferation across the organisation, which is what happened within Photobox,” he says.

According to Hibbitt, in isolation, an observability tool that is wrapped around a specific microservice can work perfectly well. “The challenge,” he says, “is when you cross boundaries between different microservices.” For instance, the customer experience journey at Photobox touches at least three different front-end services. It also requires another dozen or so back-end services.

Often in site reliability engineering, the team looks at the end-to-end customer experience. But, as Hibbitt points out, a customer’s journey on Photobox occurs over a protracted period of time.

“If you need to build a photo book, you dedicate your time to creating it,” he says. “You could do this within a couple of hours, but if you really want to create something special, where you’re putting a lot of love and effort into producing a photo book, it may take a week of working a couple of hours each night.”

This is the challenge Photobox faces when it comes to observability with teams using different tools. “It becomes impossible to track a customer journey like this, that runs over a long period of time across 10 different tools,” he says.

This was what Hibbitt faced when he experienced his first Black Friday at Photobox four years ago. “I was practically pulling my hair out because I couldn’t have enough windows open across our different tools,” he says.

Whenever he needed to check out a particular problem, such as if a customer raised an issue with the site, Hibbitt found he had to use the monitoring tools the developers had originally deployed for observability of the microservices they had developed. This manual tracing of the customer journey would be impossible to scale, and is a problem that cannot be solved simply by hiring more site reliability engineers.

“You couldn’t expect a relatively new engineer to understand a customer journey when it’s so challenging to instrument across our stack,” he says. “You might have data coming in from one tool that is different to another tool, and you have no way of comparing this data. It’s an apples and oranges problem.”

Looking at the big picture

Photobox has now introduced Dynatrace to provide standardisation for observability of its microservices. Hibbitt says the tool enables Photobox to have a common approach to looking at different microservices.

The company is also using the artificial intelligence (AI) in Dynatrace for automating alerts when a threshold level on site reliability is breached.

“We do not have to build out custom alerts and custom thresholds,” says Hibbitt. “Davis, the AI in Dynatrace, is very good at automatically understanding what our baseline for particular services looks like. It assesses error rates and the number of calls passing through different services to create a picture of the overall state of the Photobox platform.”

How Photobox keeps site reliability in the picture

Photobox’s site reliability head discusses how the photo book and personalised gifts site manages a complex microservices architecture

Pulling together puzzle pieces

Looking at the big picture

Read more about site reliability engineering

Read more on Software development tools

What to expect from Dynatrace Perform 2024

EDF deploys Dynatrace to fuel site reliability engineering drive

How William Hill’s IT copes with big sporting events

Dynatrace users make headway with AIOps