Companies are investing hundreds of thousands of pounds in high-availability systems for datacentres but are failing to follow best practice maintenance procedures to avoid having a single point of failure.
Even though the IT within datacentre sites can offer 99.99% availability and no single point of failure, IT directors are failing to assess the risk of human error in mechanical, electrical and IT systems, said Mick Dalton, chairman of the British Institute of Facilities Managers.
Dalton, who is also group operations director at Global Switch, has seen examples of downtime arising from someone simply plugging in a device not approved for the datacentre.
For example, datacentres have been brought down by incidents as mundane as a janitor plugging in a vacuum cleaner and IT staff plugging in radios with faulty power cables.
To prevent such problems occurring, Dalton recommended that IT directors ensure T-bar power sockets and plugs are used throughout the datacentre.
Problems can also arise due to poor maintenance practices. Keysource, an electrical engineering consultancy specialising in datacentres, said, "Essential, ongoing service and maintenance often falls short of the rigorous regime required to deliver high levels of availability."
IT directors should also check the reliability of the datacentre's water cooling system, said Mark Seymour, a director at Future Facilities, which provides thermodynamics modelling tools to identify hotspots in cooling systems. This is because an inability to cool hot equipment will cause servers to shut down due to "thermal shock".
All this shows that IT directors need to assess IT, electrical and mechanical systems, people and processes in order to have a thorough understanding of where points of failure can occur and how the risks can be minimised.
IDC analyst Claus Egge said, "The only way to ensure that fall-back plans work is to test them." He noted that many IT sites do not test. "Even if a site tests regularly, it may not have tested the exact events that cause downtime."
Over the summer there have been several high-profile datacentre glitches, resulting in downtime. Co-location provider Telehouse suffered an outage on 17 August at its London Docklands datacentre due to a phased supply that had burnt through.
CSC had a datacentre power failure at its Maidstone facility on 30 July, which led to the disruption of NHS IT services in the North West and West Midlands.
The glitch at CSC's datacentre was initially caused by maintenance work on the uninterruptible power supply (UPS) system leading to a short circuit. The circuit breakers tripped causing a total loss of power that lasted for 45 minutes.
Why datacentres fail
- Inadequate risk assessments and method statements that do not address and mitigate the real potential of downtime
- Incorrect circuit breaker settings
- Poor co-ordination between datacentre managers, facilities management and IT
- Insufficient live testing of real failure situations
- Not enough built-in redundancy
- Inexperienced, unsupervised engineers carrying out essential service and maintenance
- Limited UPS battery testing and performance monitoring to measure gradual failure of components
- Lack of planning for multiple concurrent events or cascade failure.
Vote for your IT greats
Who have been the most influential people in IT in the past 40 years? The greatest organisations? The best hardware and software technologies? As part of Computer Weekly’s 40th anniversary celebrations, we are asking our readers who and what has really made a difference?
Vote now at: www.computerweekly.com/ITgreats