Google Cloud Next '18: What datacentre operators can learn from how Google SRE teams operate

To coincide with the first day of the Google Cloud Next 2018 conference (taking place from 24-26 July) in San Francisco, John Jainschigg, content strategy lead at enterprise systems monitoring software provider Opsview shares his views on what datacentre operators can learn from the search giant’s site reliability engineers.

The noughties witnessed many experimental breakthroughs in technology, from the introduction of the iPod to the launch of YouTube. This era also saw a fresh-faced Google, embarking on a quest to expand its portfolio of services beyond search. Much like any highly ambitious, innovative technology initiative, the firm encountered a number of challenges along the way.

In response, Google began evolving a discipline called Site Reliability Engineering (SRE), about which they published a very useful and fascinating book in 2016. SRE and DevOps share a lot of conceptual and an increasing amount of practical DNA; particularly true since cloud software and tooling have now evolved to enable ambitious folks to begin emulating parts of Google’s infrastructure using open source software like Kubernetes.

Google has used the statement “class SRE implements DevOps” to title a new (and growing) video playlist by Liz Fong-Jones and Seth Vargo of Google Cloud Platform, showing how and where these disciplines connect, while nudging DevOps practitioners to consider some key SRE insights, including the following.

  • The normality of failure: It is near impossible to produce 100% uptime for a service. Therefore, expecting such a high success-rate is expensive, and pointless, given the existence of masking error rates among your service’s dependencies.
  • Ensure your organisation agrees on its Service Level Indicators (SLIS) and Objectives (SLOs): Since failure is normal, you need to agree across your entire organisation what availability means; what specific metrics are relevant in determining availability (SLIs); and what acceptable availability looks like, numerically, in terms of these metrics (SLOs).
  • Create an ‘error budget’ using agreed-upon SLOs: SLO is used to define what SREs call the “error budget” which is a numeric line in the sand (such as minutes of service downtime acceptable per month). The error budget is used to encourage collective ownership of service availability and blamelessly resolve disputes about balancing risk and stability. For example, if programmers are releasing risky new features too frequently, compromising availability, this will deplete the error budget. SREs can point to the at-risk error budget, and argue for halting releases and refocusing coders on efforts to improve system resilience.

The error budget point is important because it lets the organisation as a whole effectively balance speed/risk with stability. Paying attention to this economy encourages investment in strategies that accelerate the business while minimising risk: writing error- and chaos-tolerant apps, automating away pointless toil, advancing by means of small changes, and evaluating ‘canary’ deployments before proceeding with full releases.

Monitoring systems are key to making this whole, elegant tranche of DevOps/SRE discipline work. It’s important to note (because Google isn’t running your datacentre) this has nothing to do with what kind of technologies you’re monitoring, with the processes you’re wrangling, or with the specific techniques you might apply to stay above your SLOs. In short, it makes just as much sense to apply SRE metrics discipline to conventional enterprise systems as it does to twelve-factor apps running on container orchestration.

So with that in mind, these are the main things that Google SRE can teach you about monitoring:

  • Do not over-alert the user: Alert exhaustion is a real thing, and paging a human is an expensive use of an employee’s time.
  • Be smart by efficiently deploying monitoring experts: Google SRE teams with a dozen or so members typically employ one or two monitoring specialists. But they don’t busy these experts by having them stare at real-time charts and graphs to spot problems: that’s a kind of work SREs call ‘toil’ — they think it’s ineffective and they know it doesn’t scale.
  • Clear, real-time analysis, with no smoke and mirrors: Google SREs like simple, fast monitoring systems that help them quickly figure out why problems occurred, after they occurred. They don’t trust magic solutions that try to automate root-cause analysis, and they try to keep alerting rules in general as simple as possible, without complex dependency hierarchies, except for (rare) parts of their systems that are in very stable, unambiguous states.
  •  The value of far-reaching “white box” monitoring: Google likes to perform deeply introspective monitoring of target systems grouped by application. Viewing related metrics from all systems (e.g., databases, web servers) supporting an application lets them identify root causes with less ambiguity (for example, is the database really slow, or is there a problem on the network link between it and the web host?)
  • Latency, traffic/demand, errors, and saturation: Part of the point of monitoring is communication, and Google SREs strongly favour building SLOs (and SLAs) on small groups of related, easily-understood SLI metrics. As such, it is believed that measuring “four golden signals” – latency, traffic/demand, errors, and saturation – can help pinpoint most problems, even in complex systems such as carrier orchestrators with limited workload visibility. It’s important to note, however, that this austere schematic doesn’t automatically confer simplicity, as some monitoring makers have suggested. Google notes that ‘errors’ are intrinsically hugely diverse, and range from easy to almost impossible to trap; and they note that ‘saturation’ often depends on monitoring constrained resources (e.g., CPU capacity, RAM, etc.) and carefully testing hypotheses about the levels at which utilisation becomes problematic.

Ultimately, an effective DevOps monitoring system must entail far more than do-it-yourself toolkits. While versatility and configurability are essential, more important is the ability of a mature monitoring solution to provide distilled operational intelligence about specific systems and services under observation, along with the ability to group and visualise these systems collectively, as business services.

Data Center
Data Management