Woody - Fotolia

Powering the cloud: Unlocking the secrets of Google datacentres

At the Google Cloud Next conference in San Francisco, the internet search giant opened up about how it secures, operates and stress-tests its growing cloud datacentre fleet

This article can also be found in the Premium Editorial Download: Computer Weekly: Google power – how the internet giant builds datacentres

The sheer number of users and services the datacentres of the hyperscale cloud giants have to support has prompted many providers to rip up the design rulebook on how to kit out and connect the huge number of facilities they operate around the world.

Instead of building singular, standalone facilities that are backed up to a datacentre at another location, they favour the creation of multiple, huge, campus-like server farms that are devoid of any single points of failure to guard against downtime.

Hyperscale operators often opt for custom-made hardware designed with specific workloads in mind, which is bought in huge quantities to ensure – as more users flock to their services – they have seemingly infinite capacity to cope with the demand.

Over the course of several days at the Google Cloud Next conference in San Francisco, the internet search giant shared a number of candid insights about the work that goes into ensuring its own datacentres are run in a sustainable, efficient, resilient, secure and fast-performing way.

Google’s work around datacentre sustainability is well documented, with 2017 already pegged as the year it will hit its 100% renewable energy usage pledge for its datacentre estate.

As previously reported by Computer Weekly, the company has also recently opened up about how it is drawing on the artificial intelligence expertise of its Deepmind division to cut the Power Usage Effectiveness (PUE) scores of its datacentre fleets.

Another commitment it is in the throes of delivering on is a promise to open one new datacentre region a month throughout 2017. At Google Cloud Next, it also announced plans for additional builds in the Netherlands, Canada and California during this year and 2018.

By the time these are complete, the company will have 16 geographic datacentre regions in play across the globe, made up of around 50 availability zones, as well as more than 100 points of presence.

Hosting consumer-focused services

As well as standing up the Google Cloud Platform (GCP) and its business productivity tools, G Suite, these datacentres are also the same ones that host its consumer-focused services, like search and YouTube, which form the backbone of almost every web user’s internet experience.

For this reason, the company’s datacentre infrastructure is designed to ensure users can be as productive as possible at all times, said Urs Hölzle, senior vice-president for technology infrastructure at Google Cloud, during the event’s second-day keynote.

“We designed every element of our infrastructure so you could be uniquely productive and enjoy the performance we created,” he said.

“You have to optimise every single element. From efficient datacentres to custom servers, to custom networking gear to a software-defined global backbone, to specialised application-specific integrated circuits (ASICs) for machine learning.”

The company has invested $30bn over the last three years to build a resilient and responsive infrastructure, which is underpinned by huge networking capacity.

“Analysts put our network’s traffic at between 25-40% of global internet user traffic. As a GCP or G Suite customer, you benefit from this network because your traffic travels on our private, ultra-high speed backbone for minimum latency,” said Hölzle.

“To carry this traffic to pretty much everywhere in the world, we also need to cross oceans. Nine years ago, Google became the first non-telco to build an undersea cable. That was US to Japan, and since then we’ve built or acquired submarine fibre capacity pretty much anywhere in the world, so we have a redundant backbone to almost any place.”

Driving up hardware performance

Joe Kava, the Google vice-president for datacentres, presented a session on the penultimate day of the show that offered attendees a look behind the scenes at how the company builds its server farms.

While it would be logical to assume the company must take a one-size-fits-all approach to datacentre builds, the truth could not be any more different, with each datacentre location hugely influencing the design and setup.

“We’ve pioneered and developed advancements in water-based cooling systems, such as seawater cooling, recycled grey water cooling, storm water capture and reuse, rainwater harvesting, industrial canal water use and thermal energy storage,” said Kava.

“We’ve also designed datacentres that don’t require any water for their cooling at all. Instead, they’re cooled with 100% outside, fresh air. The point is there is no one-size-fits-all model here.

“All of our datacentre designs are custom made for their specific regions to get the best efficiency,” he added.

Like many other hyperscale cloud firms, the company favours the use of custom-built hardware for cost and performance reasons, with Kava alluding to the fact that without doing so the company would struggle to meet user demand for its services.

Urs Hölzle, senior vice-president for technology infrastructure at Google Cloud

“Nearly all our infrastructure is custom-designed and purpose-built for our own computing needs, all working in conjunction and optimised to provide the highest performance, at the lowest total cost of ownership computing anywhere,” he said.

“Our servers don’t have any unnecessary components, such as video cards, chipsets or peripheral connectors, which can introduce vulnerabilities, and our production servers run a custom-designed and stripped down version of Linux. Our servers and operating system are designed for the sole purpose of powering Google services only.”

As touched upon during the second-day keynote, the company is also the first cloud provider in the world to deploy Intel’s Xeon processors, known colloquially as Skylake, in its infrastructure, with Hölzle hailing the move as a show of the company’s commitment to performance improvements.

“We’re pushing the envelope in so many directions on performance, which means we have to work very differently, and Skylake offers great performance for compute-intensive workloads,” he said.

Customised cloud infrastructures

When he first joined the company nine years ago, Kava admitted to feeling perplexed as to why the company needed such a high degree of customisation in its infrastructure to deliver its services.

“I soon learned we go through such extraordinary effort because what we needed at our scale didn’t exist when we started,” he said.

“In order to achieve the performance, efficiency and price targets, we had to build our own servers and develop and create the hardware, software and culture of reliability to make Google successful.”

Since March 2016, Google has been actively involved with the Facebook-backed Open Compute Project (OCP) initiative, and has contributed designs relating to the 49-volt rack systems with which it kits its sites out.

“We also invest a lot in robotic innovation in our datacentres. Each of our datacentres has fully automated disk erase environments that allows for faster, higher throughput, more efficient and better inventory management,” he added.

“We had to build our own servers and develop and create the hardware, software and culture of reliability to make Google successful”
Joe Kava, Google

That is not to say that human beings do not have a role to play in keeping things ticking over in the Google datacentre estate, said Kava, as the organisation has 24-hour support on hand at each one.

“We have our own team of Googlers, who have been intimately involved from the design, through construction, commissioning and operations. They’re the best and brightest engineers and operation professionals available anywhere,” he said.

“Many of them have come from mission-critical environments, like the navy nuclear submarine programme, where mistakes can be catastrophic. They understand mission critical.”

Given the proximity of the event to Amazon Web Services' (AWS) high-profile Simple Storage Service (S3) outage at the end of February 2017, the cause of which was an engineering input error, Kava was also keen to point out how impervious Google’s infrastructure is to human error.

“On the infrastructure side, the industry norm is that human error accounts for the overwhelming majority of incidents,” he said.

“Because of our designs and highly qualified staff, only a small fraction of issues are related to human errors, and of those, none of them has ever caused downtime in our datacentres.”

Locking down the datacentre

Whenever naysayers see fit to call into question the security offered by public cloud companies, a common retort often sees providers compare the financial and staffing resources they have at their disposal to those of a smaller, everyday enterprise organisation.

It is an approach Google has seen fit to pursue in the past, and one Hölzle reinforced during the keynote, where he revealed one datacentre campus the company operates has 175 security guards on duty 24 hours a day, seven days a week.

This, in turn, is backed by cameras, motion sensors, iris scanners and laser-based intrusion detection systems, which are all designed to keep out people who should not be there.

This commitment to keeping people out extends to the physical hardware too, with Hölzle using the keynote to debut Google’s Titan chip, which is fitted in all of the firm’s new datacentre servers.

“We put a security chip on all our new machines to serve as the basis of trust for that machine’s identity. This chip is designed by Google, and helps protect servers from tampering, even at the BIOS level,” he said.

“It helps us authenticate hardware, and on top of that, helps us authenticate the services, as when they call each other, they must mutually prove their ID to each other.”

Read more about datacentre security

The company also has a novel way of ensuring its security defences are up to the job, revealed Kava, during a Q&A at the end of his session.

This sees the firm covertly recruit existing Google employees, and task them with breaking its datacentre security defences to ensure they can withstand insider threats.

“If anyone knows where the weaknesses are and how to exploit them, it’s your own employees. They don’t tell any of their colleagues they were recruited, and they try to do things you’re not supposed to be able to do,” he said.

“If ever there is a vulnerability exposed because of that, it’s corrected around the world,” said Kava. “If there are never any new exploits found, maybe enough is enough. We haven’t got to that point yet. There is always something more we can do.”

Read more on Infrastructure-as-a-Service (IaaS)