Warakorn - Fotolia

HumanOps calls for improved working conditions for infrastructure operations staff

The pressure of delivering 24/7 support to IT infrastructure is taking its toll on the people tasked with caring for it, warns HumanOps advocates

This article can also be found in the Premium Editorial Download: Computer Weekly: Apple lures open source developers with Swift

Pressure is growing on the technology industry to start paying closer attention to the mental and physical toll infrastructure management work takes on the health and well-being of on-call operations staff.

As people continue to become increasingly reliant on online services in the workplace and at home, the pressure on IT operations teams to keep these services up and running is growing accordingly.

This can lead to anti-social working hours, as users come to expect these services to be accessible at any time of day, knocking the work-life balance of IT operations staff out of kilter should something go wrong.

This has led to a groundswell of interest in the DevOps community about how the push towards continuous delivery of code and the expectation of round-the-clock support for IT infrastructure is affecting the health and well-being of people responsible for delivering it.

The term “HumanOps” is used widely in the community to describe the human side of managing IT infrastructure. It has now given rise to its first standalone event on the issue, spearheaded by infrastructure monitoring company ServerDensity.

The company’s CEO and co-founder David Mytton spoke at the inaugural HumanOps Meetup in London on 19 May 2016 on the matter. He said the IT industry is so preoccupied with the technological aspects of keeping its systems online that the welfare needs of staff are often overlooked.

“One of the things we look at every day is monitoring and the infrastructure of our customers. We make sure it’s up and running, we give you the data to fix problems, but we’re also waking people up when there is a problem, sending out alerts and getting people out of bed when things are broken,” said Mytton.

“We’ve noticed the industry focus on APIs [application programming interfaces], tools and automation and they are really cool, and we don’t want to stop people talking about those things, but there is not enough focus on the humans that are actually running these systems.”

Read more about DevOps

The stress caused by antisocial working patterns and out-of-hours work calls is rarely considered by companies when they set expectations about the responsiveness of their IT operations staff, added Mytton.

“We’re not thinking about stress caused by being woken up at 3am, [and the impact it has on] your family and people in the same environment as you,” he said.

“Companies [need to] think about this when prioritising work, and building their tools and systems so the humans running them can also be running at maximum efficiency. All those metrics you see in the infrastructure should be applied to humans as well.”

Redistributing responsibility

The event’s aim is to draw attention to the HumanOps movement and kick-start a conversation in the IT industry about the physical and mental toll infrastructure management work can have on the people doing it.

Francesc Zacarias, a site reliability engineer at Spotify, spoke at the event about how the music streaming site’s decision to adopt a more DevOps-friendly approach to management has helped reduce the pressure its operations staff find themselves under.

Five years ago, Spotify had multiple teams of software developers, all passing their code to a single team who were responsible for carrying out all of the company’s operations tasks, he said.

“Every time the developers wanted to do a change in production, it had to be reviewed by operations people. Back then we had hundreds of services, meaning there were dozens of changes going through every day, and operations became a bottleneck for the deployment of services,” he said.

“All the operations people had to be on call for all of the services, and – with there being so many – every night something would break, so it was a very stressful role.”

Operations does not need to take care of the entire company, and we manage to scale and grow
Francesc Zacarias, Spotify

The company has since moved to create a number of cross-functional teams, containing a mix of developers and operations staff, with the former now encouraged to take a more active role in the management and deployment of the code they produce.

“Developers will not only create services, but they will have to do monitoring and set backups,” he said. “It’s all the operations stuff and they take on-call.”

The operations team are only expected to be on-call now for the services their cross-functional team is responsible for, rather than Spotify’s entire infrastructure.

As a result, there is now no standalone operations team at Spotify, aside from a group responsible for providing the whole company with access to infrastructure resources.

“Everyone is doing operations tasks in one way or another, so there is no operations team any more. The development teams are completely independent and can do everything themselves,” he said. 

“Operations does not need to take care of the entire company, and we manage to scale and grow.”

Prioritising on-call alerts

Bob Walker, head of web operations for GOV.UK, said the Government Digital Service (GDS) has systems that work to minimise the scenarios that its staff could be called on to do work out-of-hours.

There is a nine-strong team in place to do out-of-hours and be on-call, who work in shifts in groups of two, but there are just half a dozen scenarios that will result in them being paged.

These include updating a website with up-to-the minute information, in the event of a terrorist attack, or when an urgent piece of legislation needs publishing online, for example.

“We’re very careful about this to make sure it only happens when something user critical has occurred, because our focus is on user needs,” said Walker.

“In three years of being on call, I was only called out of hours five times and only once was that when I was asleep, and it turned out to be a false alarm.”

Whenever an out-of-hours alert occurs, a review will be carried out the following day to see if it was worthwhile waking someone up in the middle of the night to sort out.

“We carry out a mini incident review each time because we need to see if this call was worth having, was it a false alarm, are we monitoring the wrong thing, and we want our people to be happy, and we all like sleep,” he added.

Read more on DevOps