How to succeed at site reliability engineering

Adoption of site reliability engineering is growing in Asia-Pacific, but organisations need to have the right organisational structure and culture to succeed

The adoption of site reliability engineering (SRE) is growing in the Asia-Pacific (APAC) region, but organisations that succeed at it are few and far between, according to an industry expert.

Michael Ewald, director of engineering for APAC at Contino, a DevOps and cloud transformation consultancy, said that in his experience, most organisations simply read Google’s seminal SRE playbooks without fully understanding the principles behind the practices.

Conceived by Google in 2003 to keep its websites running reliably, SRE is “what happens when you ask a software engineer to design an operations team”, according to Ben Traynor, vice-president of engineering at Google and founder of Google SRE.

SRE principles are encompassed in a set of practices that apply automation, performance monitoring and incident response capabilities to software development and operations, with a blameless post-mortem culture that discourages finger-pointing when things go wrong.

Ewald said getting SRE right requires organisations to understand the fundamentals of DevOps, from which SRE emerged.

“DevOps is really about cultural transformation that comes with a set of principles,” he said. “Google picked up those principles and put engineering standards and a practice to encompass how those principles should be manifested in day-to-day operations.”

Ewald said SRE often starts with forming cross-functional teams to break down the silos between operations and development teams and improving engineering standards by hiring engineers who can not only code well, but also understand the full lifecycle of software development and deployment.

But that does not necessarily mean having a single DevOps team, which, in Ewald’s experience, does not work out in large organisations. “As you’re building more code, it’s not practical to maintain 30 to 50 code bases,” he said. “You need to hand it over to an operations team.”

Ewald said that in SRE, operations teams, guided by ITIL (IT Infrastructure Library) practices, need to evolve with the new cloud operating model. This means keeping track of specific performance metrics – such as latency and throughput – that matter to the business, rather than measuring every single indicator.

One key characteristic of SRE teams is that they comprise versatile engineers who are proficient in both software development and operations, enabling them to move between the two functions easily.

Read more about DevOps and SRE

But that is easier said than done. Recalling his experience at the Commonwealth Bank of Australia, where he took on a role to manage the bank’s IT operations, Ewald said he faced challenges in getting his software developers to be more operations-minded and his operations engineers to be more development-minded.

“They are very focused individuals and have unique skillsets, so your organisational structure needs to adapt to that,” he said. “You can get software engineers to be more operations-aware, but can you get to the point where you can transition them into operations engineers? Those are in the minority.”

As such, from a talent perspective, Ewald said building a good SRE practice is difficult because organisations will need to hire the cream of the crop and compete against the likes of Amazon, Google, Microsoft and Facebook to attract the top talent. Instead, he advised organisations to build and train a team and play to their strengths.

Then there is also the challenge of securing executive buy-in. Ewald said that even if business leaders understand the value that SRE brings, they could be taken aback by the cost of introducing more automation and managing their technical debt, especially at a time when many are still grappling with the economic fallout from the Covid-19 pandemic.

“You need a huge investment around observability and automation, so when those costs go to the senior managers, they see big dollars to transform,” said Ewald. “It’s not about their desire to move, it’s more about whether they can afford to move.”

Ewald agreed that organisations that do a better job with aligning business and IT through key performance indicators such as net promoter scores and service levels will have a better chance of success with SRE.

“That’s the beauty of SRE – you have a business outcome which gets translated into service-level agreements [SLAs],” he said. “Then, that gets drilled down into service-level objectives and performance measurements to meet those SLAs.”

Read more on DevOps

Data Center
Data Management