VGX Ultra - stock.adobe.com
Cloudflare contrite after worst outage since 2019
Cloudflare CEO Matthew Prince apologises for the firm’s worst outage in years and shares details of how a change to database system permissions caused a cascading effect that brought down some of the web’s biggest names
Cloudflare co-founder and CEO Matthew Prince has described the Tuesday 18 November hiccup that disrupted global internet traffic for hours as the organisation’s worst outage since 2019, saying that the traffic management giant has not experienced an issue that has caused the majority of core traffic to stop flowing through its network in more than six years.
“An outage like today is unacceptable. We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past, it’s always led to us building new, more resilient systems,” said Prince. “On behalf of the entire team at Cloudflare, I would like to apologise for the pain we caused the internet today.”
The Cloudflare outage began at 11.20am UTC (6.20am EST) on Tuesday when its network began experiencing significant failures to deliver core traffic, which manifested to ordinary web users as an error page indicating a Cloudflare network failure when they tried to access a customer site. The issue was triggered not by a cyber attack or malicious activity, but a minor change affecting a file used by Cloudflare’s Bot Management security system.
Cloudflare Bot Management includes a machine learning model that generates bot “scores” for any request crossing the network – these scores are used by customers to allow or disallow bots from accessing their sites. It relies on a feature configuration file that the model uses to predict whether a request is automated or not, and because the bot landscape is so dynamic, it is refreshed and pushed live every few minutes specifically so that Cloudflare can react to new bots and attacks.
The outage originated from a change to database system permissions that caused said database to output multiple entries into the feature configuration file. The file rapidly increased in size and was unfortunately propagated to all the machines comprising Cloudflare’s network. These machines – which route traffic across the network – were supposed to read the file to update the Bot Management system but because their software has a limit on the size of the feature file, it failed when the larger-than-expected feature file showed up, causing the machines to crash.
DDoS confusion
Prince said Cloudflare’s tech teams at first suspected they were seeing a hyperscale distributed-denial-of-service (DDoS) attack because of two factors. First, Cloudflare’s own status page, which is hosted off its infrastructure with no dependencies, coincidentally went down. Second, at the beginning of the outage period, Cloudflare saw brief periods of apparent system recovery.
This was not, however, the result of threat actor activity – rather, it was happening because the feature file was being generated every five minutes by a query running on a ClickHouse database cluster, which was itself in the process of being updated to improve permissions management.
The dodgy file was therefore only generated if the query ran on an updated part of the cluster, so every five minutes there was a chance of either normal or abnormal feature files being generated and propagated.
“This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network,” said Prince. “Initially, this led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilised in the failing state.”
These errors continued until the tech team was able to identify the issue and resolve it by stopping the generation and propagation of the bad feature file, manually inserting a “known good” file into the distribution queue, and then turning the core proxy off and on again. This done, things started to return to normal from 2.30pm onwards, and the number of baseline errors on Cloudflare’s network returned to normal about two-and-a-half hours later.
Risk and resilience
Although Cloudflare was not itself attacked by a threat actor, the outage is still a serious cyber risk issue with lessons to be learned not just at Cloudflare, but among all organisations, whether or not they are customers. It has exposed a deeper, systemic risk in that too much of the internet’s infrastructure rests on only a few shoulders.
Ryan Polk, policy director at US-based non-profit the Internet Society, said that market concentration among content delivery networks (CDNs) had steadily increased since 2020: “CDNs offer clear advantages – they improve reliability, reduce latency and lower transit demand. However, when too much internet traffic is concentrated within a few providers, these networks can become single points of failure that disrupt access to large parts of the internet.
“Organisations should assess the resilience of the services they rely on and examine their supply chains. Which systems and providers are critical to their operations? Where do single points of failure exist? Companies should explore ways to diversify, such as using multiple cloud, CDN or authentication providers to reduce risk and improve overall resilience.”
Martin Greenfield, CEO at Quod Orbis, a continuous monitoring platform, added: “When a single auto-generated configuration file can take major parts of the web offline, that’s not purely a Cloudflare issue but a fragility problem that has become baked into how organisations build their security stacks.
“Automation makes security scalable, but when automated configuration propagates instantly across a global network, it also scales failure. What’s missing in most organisations, and was clearly missing here, is automated assurance that validates those configurations before they go live. Automation without assurance is fragility at scale and relying on one vendor can’t stand up for an effective resilience strategy.”
For its part, Prince said Cloudflare will be taking steps to lessen the chances of such an issue cropping up again in the future. These include hardening the ingestion of Cloudflare-generated configuration files in the same way it would do for user-generated inputs, enabling global kill-switches for features, working to eliminate the ability for core dumps or error reports to overwhelm system resources, and reviewing failure modes for error conditions across all of its core proxy modules.
Read more about Cloudflare
- Publishers and other providers of creative content now have the option to block AI crawlers from accessing and scraping their intellectual property with new tools from Cloudflare.
- Cloudflare’s new suite helps businesses, developers and content creators deploy AI technology at scale safely and securely.
- Compare the key features of Cloudflare vs Amazon CloudFront to determine which of these two popular CDN services best meets your organisation’s content delivery needs.
