Production Perig - stock.adobe.c
Concerns about the cooling system setup in one of the datacentres used to host key healthcare systems for Guy’s and St Thomas’ NHS Foundation Trust were raised in 2018 and never fully acted on, a report into the heatwave-related server farm outage it suffered in the summer of 2022 has revealed.
As previously reported by Computer Weekly, the two datacentres the Trust relied on experienced cooling-related technical difficulties on Tuesday 19 July 2022, which is the day when UK temperatures hit a record-breaking high of 40°C.
As confirmed by a 58-page review of the incident, published in late January 2023, the extreme temperatures the UK experienced that day led to two datacentres used to host the Trust’s 371 legacy IT systems overheating and malfunctioning.
The two sites, one located at Guy’s Hospital and the other at St Thomas’, were designed to act as backups for each other in the event of an IT failure, but – on 19 July 2022 – both sites suffered failures as a direct result of the UK heatwave.
The impact of the incident was felt for several months after, with the report stating the recovery was also hindered by an unrelated cyber attack on an external supplier the Trust relied on to host a medical records system for it in August 2022.
“The Trust declared a critical site incident on 19 July and moved to implement a paper-based operating model to support clinical activity,” the report states. “The technical recovery of the IT systems took substantially longer than was anticipated at the outset, lasting several weeks before near complete restoration. The critical site incident was stood down on 21 September, having included management of the unrelated cycle attack on an external supplier from 4 August onwards.”
The document also confirmed the incident resulted in the Trust incurring £1.4m in unexpected IT costs, because it needed to enlist the help of a third-party data recovery service to extract information stored on servers damaged by the outage, and it also needed to create a new cloud-based data backup system.
A potentially preventable event
The report describes the datacentre outages as a “potentially preventable event” and says it is “self-evident” that the risk prediction, mitigations and reporting systems the Trust had in place were inadequate.
“This represents a failure of the Trust’s risk management processes to effectively mitigate the risk of datacentre failure,” the review states.
It also says that while the review found “no single, egregious failure” to point to as a root cause, its investigations suggest it was a combination of factors that led to the “catastrophic failure” of the Trust’s IT systems.
These factors include the age of its technology infrastructure, the “overly” complex nature of its datacentre estate, and the “sub-optimal cooling systems” it had in place.
A timeline of the incident, detailed in the report, reveals that concerns about the setup of cooling systems in the St Thomas’ datacentre were first flagged by a supplier in August 2018, who identified that the site’s air conditioning condensers were not “optimally situated for air flow”.
A recommendation was made at the time for the condensers to be moved, but – while other mitigations were introduced – this change was not enacted.
Read more about datacentre system outages
- July’s temperature spike put headline heat on regional operations at Google and Oracle. Datacentre operators must side-step similar mistakes to avoid climate change-related collapse in future
- The record-breaking heat that blighted the UK this week caused Google’s and Oracle’s local datacentre regions to experience technical difficulties on Tuesday 19 July, it has emerged.
The timeline also states that a review of the Guy’s datacentre by the same supplier initially suggested its air-handling units would be approaching end of life in “2021/2022”, although this assessment was later revised in February 2022 and extended by a further 12 months.
As a result of this assessment, a request for £195,000 in funding so that a replacement system could be installed was issued to the Trust in March 2022, but had not been approved by the time of the outage. This funding request has since been increased to £360,000 and approved, the review document confirmed.
“The Trust must never again allow itself to be in a situation where the recovery of its core IT systems, whether as a result of infrastructure failure, cyber attack or another cause, takes so long to complete,” the review states.
“As a result, the Trust must put in place a comprehensive strategic plan, backed by appropriate investment, to ensure future computer processing and data storage requirements are robust, able to meet growing demand and also resilient to foreseeable risks. These plans should include periodic and thorough testing of systems recovery.”
The Trust is on course to roll out a new electronic health record system in April 2023, according to the review, which will pave the way for a “rationalisation and consolidation” of its legacy IT systems and – it is hoped – bolster the resiliency of its IT systems.
On the point of resiliency, the report says the Trust “must prepare for the fact that climate change means extreme weather events are expected to become more frequent and challenging in future”. In response, it has confirmed it has commissioned expert advice on how to ensure its systems are better equipped to cope with such threats in future.