chombosan -

Coronavirus: Microsoft offers behind-the-scenes look at datacentre-level Covid-19 response

Microsoft reveals how pandemic forced it to adopt 24-hour new hardware deployment schedule in datacentres to keep up with the surge in demand for services across its cloud portfolio

Microsoft claims the onset of the Covid-19 coronavirus pandemic required a 24-hour deployment schedule for new hardware racks in its hardest-hit datacentres, as it rushed to meet the surge in demand from enterprises and consumers for its cloud services.

The software giant has shared details in a series of blog posts about the challenges it has faced in recent weeks, as it sought to bring additional datacentre capacity online while abiding by the rules on social distancing, as the number of users flocking to use its services have skyrocketed.

Microsoft Azure, Office 365 and its collaboration and online meeting platform, Teams, experienced sudden spikes in usage and demand as countries across the world entered lockdown, forcing businesses to embrace remote working practices on a global scale.

According to Microsoft’s own data, the pandemic resulted in a new daily record of 2.7 billion meeting minutes taking place on Teams in a single day. In April, that record was beaten on a single day, when 4.1 billion minutes of meetings took place.

To ensure it had sufficient capacity to cope, the company established a remote team during the early weeks of the pandemic who were tasked with collecting and interpreting telemetry data related to how usage patterns for Teams changed over time as the pandemic hit China and Italy.

“As more countries went into lockdown, dozens of Redmond-based Microsoft employees gathered remotely each Sunday night to watch telemetry, look for bottlenecks and troubleshoot as unprecedented numbers of remote European workers began logging in first thing Monday morning,” said the blog post.

On the consumer side, Microsoft-owned gaming platform Xbox also experienced a 50% uplift in the number of players embarking on multi-player games and a 30% increase in peak levels of concurrent usage. Such spikes in usage patterns are usually seasonal or prompted by the release of new games.

Xbox runs on the Microsoft Azure public cloud platform too, and in response, the company began shifting gaming workloads out of its Azure datacentre regions in the UK and Asia, where demand for capacity was particularly high, to protect against any degradation in gaming experience.

“There’s no question that in those regions, the people who were on the front lines of the Covid-19 efforts really needed that capacity more than us,” said Casey Jacobs, who oversees reliability for Xbox operations, in one of the blog posts. “And our telemetry gave us confidence that we could make these trade-offs while protecting our customer experience.”

Although such steps would have gone some way to freeing up capacity in high-demand areas, Microsoft was blighted by reports during the early weeks of the pandemic about shortfalls in capacity occurring within its European datacentres.

In response, the company rolled out metering measures, whereby access to available cloud datacentre capacity began being prioritised for use by mission-critical user groups and existing customers.

It also confirmed in early April that it was working to “expedite” the creation of additional datacentre capacity in countries and geographic regions where demand for Azure, Office and Teams was acutely high.

The blog posts go into some detail about how Microsoft set about achieving that against a backdrop of social distancing, with the company revealing that this work has required adopting a round-the-clock deployment schedule for datacentre hardware within some of its sites.

“The company began adding new servers to the hardest-hit regions and installing new hardware racks 24 hours a day,” said the blog post , with workers ordered to remain at least six feet apart for social distancing reasons as the work was performed.

Read more about the impact of coronavirus on cloud and datacentre deployments

Workers were also equipped with protective equipment and its datacentres were subjected to a strict disinfectant-focused cleaning regime to keep employees safe.

“Microsoft product teams worked to find any further efficiencies to free up Azure resources for other customers, and the company doubled capacity on one of its own undersea cables carrying data across the Atlantic and negotiated with owners of another to open up additional capacity,” said the blog post.

“Network engineers installed new hardware and tripled the deployed capacity on the America Europe Connect cable in just two weeks.”

In a similar vein, the Microsoft Azure Wide Area Network team added 110 terabits of capacity in two months to the fibre optic network that is responsible for transporting Microsoft’s own data around the globe.

“Microsoft also moved internal Azure workloads to avoid demand peaks in different parts of the world and to divert traffic from datacentre regions experiencing high demand,” said the blog post.

This work served to ensure that critical services for Microsoft customers could scale as needed without affecting their stability, while allowing the firm’s cloud engineering teams to pinpoint areas within its infrastructure where efficiency improvements could be made, so computing resources could be freed up even further.

“In Teams, small tweaks that few customers would notice – lengthening the time it takes for the three dots to appear when someone else is typing in a Teams chat or disabling the feature that suggests a contact every time you type a new letter in the ‘to’ field – made the system run much more leanly,” said the blog post.

“Engineers rewrote the code to make video stream processing 10 times more efficient, in a marathon push over a weekend.”

The pace at which some of these changes have occurred is notable, with some changes – which would have traditionally taken months to introduce – being delivered within a week, said Microsoft. 

This includes changes to the way that reserved capacity for Teams is metered out, with surplus space now spread out across many more datacentre regions.

“We would have probably said [in the past], ‘Let’s hit this region first and maybe then this one’ and there would have been a lot of debate,” said Aarthi Natarajan, partner director of engineering for Teams. “In this case, we said, ‘No one is debating anything. We are doing this and we’re doing it globally right now’."

Read more on Datacentre performance troubleshooting, monitoring and optimisation

Data Center
Data Management