sdecoret -

Cloudflare outage: Firm blames global website outage on routine software update gone rogue

Content delivery provider traces outage back to failings that occurred during a routine update to its web application firewall service

Cloudflare has confirmed an outage that briefly left web users unable to access millions of internet sites across the globe on 2 July was caused by a rogue software update, and not a distributed denial of service (DDoS) attack against its network.

The outage resulted in web users encountering 502 “bad gateway” errors when trying to access sites that rely on Cloudflare’s content delivery and network security to keep them up and running.

This in turn prompted speculation that the firm had suffered a sizeable DDoS attack against its network, despite early assurances from various members of its senior management team that it was an internal infrastructure failing behind it all.

In a blog post released in the wake of the outage, the content delivery service’s chief technology officer, John Graham-Cumming, said the cause was a “bad software deploy” creating a “massive spike” in CPU utilisation across its network.

“Once rolled back, the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels,” wrote Graham-Cumming.

Several hours later, the firm released a more comprehensive drilldown into the specifics of the incident, where the “bad software deploy” was revealed to be a single misconfigured rule within the wider Cloudflare Web Application Firewall during a routine update procedure.

These updates are regularly undertaken by the Cloudflare team to protect its clients’ websites from new and emerging internet security threats. The process is reliant on the automatic roll out of new managed rules on a regular basis, and it is here where the outage originated.

“These rules were being deployed in a simulated mode where issues are identified and logged by the new rule, but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production,” the blog post continued.

“Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst, traffic dropped by 82%.”

The incident underpinning the outage lasted about 30 minutes in total, the blog confirmed, and the reason why the routine update went rogue this time is due to how the deployment was approached at scale.

“We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these [web application firewall] rules were deployed globally in one go and caused today’s outage,” the post states.

“We recognise that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.”

Read more about cloud outages

Read more on Datacentre disaster recovery and security

Data Center
Data Management