alphaspirit -

Cloudflare confirms outage caused by datacentre network configuration update error

Website application security firm Cloudflare has published a review into the cause of the short-lived outage that rendered hundreds of websites inaccessible

Cloudflare has confirmed that the short-lived outage that knocked hundreds of websites offline on Tuesday 21 June was caused by a planned network configuration change within 19 of its datacentres and was not the result of malicious activity.

As previously reported by Computer Weekly, a wide range of consumer-facing and enterprise-focused websites and online services were temporarily knocked offline during the downtime incident, which took just over an hour for the web application security company to resolve.

In a blog post, published on the same day as the outage occurred, Cloudflare said the outage was the result of a network configuration change, rolled out to 19 of its datacentres, as part of a broader body of work designed to increase the resiliency of its services at its “busiest locations”.

These facilities include several datacentres in North and South America, Europe and Asia-Pacific, which gives some context as to why one of the defining characteristics of the outage was the high number of high-profile web properties and online services affected by it.

“Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture,” said the blog post. “In this time, we’ve converted 19 of our datacentres to this architecture.

“A critical part of this new architecture… is an added layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable parts of the internet network in a datacentre for maintenance or to deal with a problem.”

And although the new setup has bolstered the robustness of its datacentre networking setup, which is important because these 19 datacentres carry a significant amount of Cloudflare’s traffic, it is also a reason why the outage had such far-reaching effects, the blog added.

“This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic,” it said.

“As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today.”

In the wake of the incident, the company has identified several areas ripe for improvement to prevent it happening again, and “will continue to work on uncovering any other gaps that could cause a recurrence”, the blog post added.  

“We are deeply sorry for the disruption to our customers and to all the users who were unable to access internet properties during the outage. We have already started working on [making] changes and will continue our diligence to ensure this cannot happen again,” it concluded.

Read more about outages

  • Despite the best efforts of datacentre operators the world over to reduce the amount of downtime their facilities suffer, the severity and financial impact of server farm outages continue to spiral.
  • Amazon Web Services (AWS) has confirmed that an application programming interface (API) issue in one of its major US datacentre regions is what led to an outage that took several of its biggest reference customers offline on the evening of Tuesday 7 December.

Read more on Datacentre performance troubleshooting, monitoring and optimisation

Data Center
Data Management