alotofpeople - stock.adobe.com
Triggered dormant bug sees Fastly CDN cut to the quick
Cloud provider acts quickly to restore content delivery network after software bug disrupts service for major firms spanning media, e-commerce and government websites, including Amazon, The Guardian and Gov.uk
Content delivery network (CDN) provider Fastly boasts that some of the world’s leading companies count on its services, but those businesses were left reeling yesterday by an internet outage caused by a hitherto undiscovered software bug.
The origin of the incident came on 12 May when Fastly began a software deployment that introduced a bug that could be triggered by what it called “specific” customer configuration under “specific” circumstances. These circumstances came to pass at 10:58 on 8 June when the Fastly infrastructure experienced the initial onset of global disruption after what it called a valid customer configuration that activated the bug. This caused 85% of the Fastly network to return errors and led to the mass lack of access to clients’ websites.
These included the sites of a number of the world’s leading firms, such as Amazon, Twitch, Reddit, The Guardian, Boots, challenger bank Stripe, content provider A&E, Reddit, The Financial Times, The New York Times and – somewhat unfortunately on a day when many were seeking advice on Covid vaccines as restrictions for younger age groups were lifted – the UK government Gov.uk website.
The territories affected included Australia, the United Arab Emirates, Japan, Singapore, Chile, Argentina, Peru, Brazil, the UK, Ireland, Denmark, the Netherlands, Germany, Finland, Spain, Norway, Italy, Sweden, the US, Canada, France, Austria, South Africa and India.
Fastly said it detected the disruption within one minute, identified and isolated the cause, and disabled the configuration. It said its engineering team identified the customer configuration at 11.27 and affected sites began to recover at 11.36. According to Fastly technical data, the incident was officially mitigated at 13.25, at which time the company warned that customers may experience increased origin load and lower cache hit ratio (CHR) as global services returned.
The company said 95% of its network was operating as normal within 49 minutes of the outage. Once the immediate effects were mitigated, Fastly turned its attention to fixing the bug and communicating with customers. It created a permanent fix for the bug and began deploying it at 18.25 on the same day.
Commenting on the outage, Fastly senior vice-president of engineering and infrastructure Nick Rockwell conceded that the outage was “broad and severe” and that Fastly was “truly sorry for the impact to our customers and everyone who relies on them”.
After deploying the bug fix across its network as quickly and safely as possible, Fastly began conducting a post-mortem of the processes and practices it followed during the incident. Top of the agenda in this process will be to discover why the company did not detect the bug during its software quality assurance and testing processes and to evaluate ways to improve the company’s remediation time.
“We have been and will continue to innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and [email protected] to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal,” said Rockwell.
“Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologise to our customers and those who rely on them for the outage and sincerely thank the community for its support,” he added.
Not surprisingly, industry leaders were quick off the mark to comment on the disruption even before the official cause was known.
Stephen Gilderdale, Dell Technologies
Stephen Gilderdale, senior director and head of pre-sales architects UK at Dell Technologies, said the sudden nature of the outage showed that with the ubiquity of cloud underpinning the network, even with providers guaranteeing availability 99.999% of the time, that 0.001% can still bring swathes of the internet to a halt. Yet he observed that far from being a cause of concern, the outage actually showed the resilience of the network that it can recover so quickly.
“One of the great strengths of the big cloud providers is that they can and do guarantee the five nines – 99.999% – to their customers. Even when outages do occur, it’s usually only for a very short amount of time,” he added. “Cloud providers build in redundancies for such events to give their users secure access to replicated copies of data. In most cases, services are only affected for a short time, and data is easily retrievable.”
Adam Leon Smith, a software testing expert with BCS, The Chartered Institute for IT, concurred, but pointed out that network complexity was a growing issue.
“The affected sites will likely be restored quickly. Outages with content delivery networks highlight the growing ecosystem of complex and coupled components that are involved in delivering internet services. Because of this, outages are increasingly hitting multiple sites and services at the same time,” he said.
Read more about network outages
- More than half of IT decision-makers and network managers globally say they have had four or more network outages lasting more than 30 minutes, costing between £250,000 and £5m in downtime.
- The Uptime Institute’s third annual datacentre outage analysis report suggests downturn in number of downtime incidents over the past 12 months due to the pandemic, with networking issues fast emerging as main cause.
- Network change management includes five basic principles, including risk analysis and peer review. These best practices can help network teams limit failed network changes and outages.