ldprod - stock.adobe.com
The financial fallout from datacentre outages is continuing to grow year on year (YoY), despite operators admitting that most downtime incidents could be avoided if they were to invest more in the resiliency of their facilities.
That’s according to the Uptime Institute’s 10th annual datacentre survey, which features responses compiled from 846 server farm owners and operators during March and April 2020.
The report confirms that datacentre downtime incidents continue to occur with “disturbing frequency”, with many operators neglecting to officially record the “worrying number” of smaller-scale outages they encounter.
“Outages in these categories signal bigger problems and are troubling more for their frequency than for their singular impact,” said the report.
At the same time, the report notes that larger outages are becoming increasingly damaging and expensive for operators to bounce back from, as the complexity of the systems housed inside the datacentre becomes ever greater.
“It is clear that outages occur with disturbing frequency, that bigger outages are becoming more damaging and expensive, and what has been gained in improved processes and engineering has been partially offset by the challenges of maintaining ever more complex systems,” the accompanying report states. “Avoiding downtime remains a top technical and management challenge for all owners and operators.”
A third of survey participants admitted to experiencing a major outage in the previous 12 months, with 31% stating this had led to “substantial financial and reputational” damage.
On this point, operators who reported suffering an outage were asked to estimate the total cost incurred by their most recent, significant downtime event, with one in six claiming their outage had cost them more than $1m. In the 2019 survey, that figure was one in 10.
Combined with the finding that a greater percentage (48%) of outages now cost firms between $100,000 and $1m than they did in 2019 (28%), the Uptime Institute said the data reinforces the view that outages are becoming increasingly expensive events for firms to overcome.
“This reflects a growing dependency on IT by all businesses and consumers, an increasing interdependency of a growing number of systems in real-time, and the immediacy of the impact of downtime on customers,” said the report.
The report also acknowledges that understanding the causes of outages is an important part of preventing them occurring, with three-quarters of respondents stating that their most recent outage was preventable.
That said, the report stated that getting to the bottom of why an outage occurred can be difficult, due to reasons of corporate secrecy, but also because operators fail to carry out thorough post-mortems into downtime events.
Even so, when asked about the primary cause of their most major outage, on-site power problems remain the single biggest source of problems, followed by software and systems errors and network issues.
“In spite of many concerns to the contrary, problems at cloud or software-as-a-service (SaaS) providers cause only a small proportion of outages,” the report said.
“In recent years, our research has shown a growing proportion of outages are caused by software and network issues. While these incidents are more common and can be difficult to fix, many cause only minor problems. The impact of a power outage is wide and deep, and the knock-on effects can be long-lasting – even if the initial failure is quickly fixed.”
The report also states that most managers and operators hold themselves responsible for any outages that occurred, with just a quarter claiming their most recent outage was down to reasons outside of their control.
“However, it is not clear if operators are openly learning from process problems or blaming their managers. It also possible managers are blaming the operators – or all could be blaming executive for underinvestment,” the report added.
“Regardless, the findings point to the clear opportunity: with more investment in management, process and training, outage frequency could almost certainly fall significantly.”
Read more about datacentre downtime and outages
- Visa has offered a retrospective analysis of what went wrong in its datacentre during its UK-wide outage on Friday 1 June, in response to a request from the Treasury Select Committee for more detail about the downtime.
- Datacentre outages are increasing in frequency and severity, as operators grapple with managing increasingly complex server farm environments, according to the Uptime Institute.
Read more on Datacentre performance troubleshooting, monitoring and optimisation
Cloudflare confirms outage caused by datacentre network configuration update error
Length, cost and severity of datacentre outages continue to rise, Uptime Institute research confirms
Uptime Institute to help financial services organisations reduce infrastructure outage risks
Oxford City Council services back online after weekend outage at SCC datacentre