How to avoid datacentre downtime

A supposedly "bullet-proof" IT infrastructure will always be vulnerable to electrical and mechanical problems. So it is important for IT directors to start talking to building services people.

A supposedly "bullet-proof" IT infrastructure will always be vulnerable to electrical and mechanical problems. So it is important for IT directors to start talking to building services people.

Part of the problem is that new IT systems put a big load on existing datacentres in terms of their power requirements and the necessary cooling. A failure in power or cooling can quickly bring down a datacentre unless dual redundancy has been built in, with cooling and power systems doubled up.

About two years ago, Trinity Mirror Group's IT team started regular meetings with building services to address the issue of business continuity.

Based at Canary Wharf in London's Docklands, Trinity Mirror's power comes up the riser and feeds MGE UPS (uninterruptible power supply) equipment, which in turn supplies the critical plant for producing newspapers.

Allen Palmer, group building services manager at Trinity Mirror, said, "In many ways we are reactive to IT's needs, because power and cooling are not infinite in any IT department.

"Sooner or later, someone plugs something in which causes a crash. Technology is changing. Earlier computer rooms supplied cool air from under the floor, and the cabinets took cool air up and through the top from that pressurised floor.

"These days, blade servers take their cooling front-to-back, so it would be wonderful if IT said that everything would run on new technology.

However, we still have 60% of the computer room using cool air from under the floor, and 40% and growing using front-to-back cooling.

"If we were designing the computer room on a greenfield site, we would design it differently, but most companies are not in that position and can only change gradually."

The UPS provides a 20-minute time window in which to shut down services in the event of a power failure.

No standby generation is available at Canary Wharf, but it is available at other sites used by the company, which include the Birmingham Post, Liverpool Post and South London Press.

If there is a power cut at Canary Wharf, Palmer said, "We can switch to a secondary incoming mains supply fed into different parts of the building by totally separate substations at Westferry Road and at Simpson's Road."

According to Gartner's Data Center Facilities Cost Survey: 2006 Update, costs have increased substantially since the last survey in 2002. The need for higher levels of reliability, coupled with increased mechanical and electrical investment, is driving the rise.

Increases can be attributed to three basic cost drivers. First, there is more emphasis on reliability. This concerns the need for redundant electrical and mechanical systems to eliminate single points of failure.

The higher the tier, the greater the level of redundancy and, therefore, higher investment costs for duplicate electrical and mechanical systems, including multiple UPS systems, generators, multiple air-conditioning units, and the duplication of electrical cabling and piping for dual water distribution.

The second cost driver is the increased initial power rating of high-density computer equipment such as blade servers, storage area network equipment and disc arrays.

Greater power is required not only for denser IT equipment, but also for increased air conditioning systems to cool the incremental heat load.

Finally, more investment in required to harden the datacentre facility's structure in response to geographic or terrorist risks.

Obviously a lot of investment goes into keeping datacentres running. Datacentre operator Globix, for instance, runs its own electrical substation backed up by five large APC UPS installations and two Cummings 2.5MW diesel generators.

Paul Court, UK operations director at Globix, said, "We have just under 100,000 litres of fuel in tanks in the basement."

For the next level of resilience, Globix uses APC units within the datacentres. Power is taken straight off the grid from electricity supplier EDF at 11kV and transformed down. This enables Globix to reduce its electricity bill.

But even with this level of resilience, problems can occur. Globix experienced a power failure in late July because of a fault on a high-voltage switchgear panel.

One of the switches at its Prospect House datacentre had failed, even though it was indicating that it was functioning OK. It was a mechanical failure. The cams and springs inside the switch had failed. The result was that the datacentre drained one UPS, then the other, then stopped.

"The problem is not the fact that the switch failed, but that it signalled it was OK," said Court. "After a while, we managed to find out what it was and managed to get the pumps to the generator going again, and then hand-cranked the switches."

The switches needed servicing - something the manufacturer had not told Globix. The switches are now six or seven years old, and it has now come to light that some of them may fail, but the manufacturer has apparently still not contacted the owners to warn them.

Telehouse, a co-location facility for major internet companies, suffered a power failure at its Docklands site in August. The problem was caused by a circuit overload on one of the phase supplies, which had burned through, causing a breaker to trip.

Phil Lydford, sales and marketing director at Telehouse, said, "We had a failure, which fortunately happens very rarely, and which affected a limited number of customers.

"We now allow a greater margin between the stated capability of the switch and the load that we put through it, which will ensure that the breaker does not trip so easily."

Some of Telehouse's customers were unaffected because of the way they configured their equipment. The whole point of resilience is to ensure you are not subject to a single point of failure.

If a datacentre draws a load of 8MW, for example, the operator can install four 2MW generators. N+1 resilience would be provided by five 2MW generators (four 2MW, giving 8MW, and one spare 2MW generator), so if one generator failed or was shut down for maintenance, you would still have power. This allows one failure. Telehouse's strategy is to support up to two failures.

In some cases it may be necessary to provide what is called N+N resilience, where the datacentre provides twice the capacity of resilience it needs.

Lydford believes this level of resilience is not normally necessary for generators because they are extraordinarily reliable and run for years, so N+2 is accepted good practice for highly available datacentres.

"However, when you get to UPS equipment, then N+N is our design norm, so we cover all potential failures completely," he said. "We would run the same N+N configuration for things like static transfer switches, normally on dual power supplies."

It is clear that to achieve high availability in the datacentre, IT directors need to look not only at the applications and server infrastructure and service level agreements associated with the IT, but also at the non-IT infrastructure - the mech­anical, electrical and plumbing systems that keep the datacentre operational.

IT directors need to pay particular attention to the health of their UPS systems. These control and condition the electrical power within the datacentre. Any failure here will put the datacentre offline, unless adequate redundancy is built in.

Sometimes the reason is human error, where staff may be needed to work after hours and are tired. According to UPS supplier APC, a common problem is when maintenance staff do not follow procedures step by step. "Skipping steps, which happens especially with well-versed personnel, can lead to downtime," it said.

Another common problem is when systems components are replaced even though there are no signs of wear or failure. This creates an opportunity for inserting other failures. Likewise, invasive checks that require the removal of other components can introduce problems.

So while technology and multiple levels of redundancy can limit the effect of failure, much of what keeps a datacentre going is down to the people. Many problems can be avoided simply by operating a two-person maintenance team.

Case study: Allen & Overy designs continuity into new systems

Allen & Overy is one of the world’s biggest law firms, with more than 4,900 employees across offices in 25 countries. In the summer of 2001, a faulty circuit breaker caused a major power failure from which the company took a month to completely recover.

In later years, it became apparent that Allen & Overy’s datacentre cooling systems were inadequate for the ever-faster servers it wanted to install. The company’s planned move to a purpose-built facility at Bishops Square in the City of London gave it the opportunity to rethink how it managed its datacentres.

Andrew Brammer, head of global IT operations at Allen & Overy, said, “We designed a datacentre in conjunction with APC and our business services department to allow us to grow in terms of the density of the equipment.

“Our day-one configuration is 3kW per rack, day two takes us up to 5kW, and with little modifications to the room, we can accommodate higher densities through blade servers to 7kW and higher by enclosing the aisles, and so on. We made that investment because, at the end of the day, we cannot move the walls out – we can only grow within the original confines of the room. We also outsourced our datacentre with Savvis to the Winnersh Triangle outside the M25.”

Under the terms of the five-year, £5.5m deal with Savvis, Allen & Overy was able to consolidate and centralise its UK-based IT infrastructure.

Read more on IT risk management