alphaspirit - Fotolia

Gmail outage: CIOs must be prepared for the unexpected

Even class-leading cloud services can and will fail. How can IT departments build in resilience to cloud computing outages?

This article can also be found in the Premium Editorial Download: Computer Weekly: IT for talent management

The recent outage of Google's Gmail service shows that even class-leading cloud services can and will fail.

How can IT departments build in resilience to cloud computing failures?

Google has been winning business over Microsoft by targeting the huge costs associated with running on-premise Microsoft Exchange servers. Analyst Forrester's Forrsights Hardware Survey; North American and Europe showed that cloud adoption between 2009 to 2012 increased from 9% to 46%.

The growth in cloud adoption is set to rise as CIOs move more of their traditional IT spending away from capital expenditure to software as a service in the cloud, paid per-user as an annual or monthly subscription. As cloud adoption increases more IT departments will need to have a battle plan in place for a cloud service outages.

Resilient but fragile

Cloud-based services, especially those with the global scale of Gmail, are seen as almost 100% reliable in terms of resilience. In fact, the streamlined operational efficiency required to run public cloud services means they can offer extremely high levels of availability for an enterprise at a fraction of the cost required by a CIO to run a traditional datacentre.

Given the scale of Google, a small problem can quickly escalate. The company experienced a major problem causing its dual redundant network to fail. The impact of the failure was that some users were unable to download email attachments and experienced delays of up to two hours.

Google took almost 12 hours to completely resolve the issue. The Gmail network team restored some of the network capacity that was lost and worked to re-purpose additional capacity, to clear the accumulated message backlog.

The failure and its resolution highlights the fragile nature of relying on a single cloud service provider, especially if the service is business critical. In the Forrester report, The 15 most important questions to ask your cloud identity and access management provider, analyst Andras Cser recommends IT managers ask their cloud providers about how  they ensure the confidentiality, integrity, and availability of customer data. This is a question not limited to cloud identity and access management. Every cloud provider should have a robust response.

In situations like this our response is very similar to how we would react if we were self-hosting on premise or outsourcing managed hosting

Mark Ridley, director of technology,

Following the outage that caused Gmail attachments to be delayed by up to two hours, Google sees the need to change its operating procedures and is now revamping its network and disaster recovery processes. 

While a dual redundant network is unlikely to fail, the Gmail outage is a classic example of Murphy's Law - if it can fail, it will - and in Google's case the effect of even a minor outage in the network, can be catastrophic.

The company's senior site reliability engineering manager Sabrina Farmer, wrote in a blog post: “We're taking steps to ensure that there is sufficient network capacity, including backup capacity for Gmail, even in the event of a rare dual network failure. We also plan to make changes to make Gmail message delivery more resilient to a network capacity shortfall in the unlikely event that one occurs in the future. Finally, we’re updating our internal practices so that we can more quickly and effectively respond to network issues.”

Google's outage, and the recent datacentre Amazon outage in August, are timely reminders to IT directors that major cloud services do fail. Instagram, Netflix, Twitter's Vine video-sharing application and holiday site Airbnb were among the services that were slow or inaccessible due the problem affecting Amazon Web Services. 

Like Google, Amazon's outage has been traced to network issues. The route cause of Amazon's outage was the partial failure of a network device in a datacentre in northern Virginia.

Questions to ask a cloud provider

a) Who is responsible for data security, backups, and BC/DR?
b) Who owns your datacentres?
c) What happens if you miss SLAs? Is there a refund?
d) How do you make sure that you provide enough capacity and performance during peak usage
Source: The 15 most important questions to ask your cloud identity and access management provider by Andras Cser

Cloud business continuity

Although was not impacted by the Gmail outage, Mark Ridley, director of technology at the recruitment site, believes handling a cloud outage is the same as if the service was running on-premise. The company uses Gmail and Google Enterprise instead of Microsoft Office. 

“In situations like this our response is very similar to how we would react if we were self-hosting on premise or outsourcing managed hosting. Having managed both over the years, I'm familiar with the spectre of 'rare events' occurring, despite every step possible being taken to prevent them," said Ridley. 

"Clearly, it's never okay to lose a critical service, but at the same time I tend towards being pragmatic about events like this. It could equally have happened on our watch - whether in-house or hosted - and if that were the case, it would be my team responding to fix issues.”

Even if it were technically feasible to deploy, running dual-redundant cloud email services such as using both Google Enterprise for Gmail and Office 365 (which provides cloud-based Microsoft Exchange),  would be prohibitive. 

If Gmail had failed catastrophically, Ridley would have relied on the company's Jive internet portal, phone calls and as he puts it: “Good, old fashioned yelling across the office to brief the teams.”

Ridley did brief the IT operations teams, just in case users put in support calls about slow email. If Gmail had been down for longer than a day he said he would have looked for an alternative.

Read more on IT suppliers

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

Everyone needs a good disaster recovery plan. That even includes the little thing like this that may cause problems that nobody thought of in the initial plan. Your system may be safe and stable but the outside service can be just as costly if they go down.