A supposedly "bullet-proof" IT infrastructure will
always be vulnerable to electrical and mechanical problems. So it
is important for IT directors to start talking to building services
people.
Part of the problem is that new IT systems put a big load on
existing datacentres in terms of their power requirements and the
necessary cooling. A failure in power or cooling can quickly bring
down a datacentre unless dual redundancy has been built in, with
cooling and power systems doubled up.
About two years ago, Trinity Mirror Group's IT team started
regular meetings with building services to address the issue of
business continuity.
Based at Canary Wharf in London's Docklands, Trinity Mirror's
power comes up the riser and feeds MGE UPS (uninterruptible power
supply) equipment, which in turn supplies the critical plant for
producing newspapers.
Allen Palmer, group building services manager at Trinity Mirror,
said, "In many ways we are reactive to IT's needs, because power
and cooling are not infinite in any IT department.
"Sooner or later, someone plugs something in which causes a
crash. Technology is changing. Earlier computer rooms supplied cool
air from under the floor, and the cabinets took cool air up and
through the top from that pressurised floor.
"These days, blade servers take their cooling front-to-back, so
it would be wonderful if IT said that everything would run on new
technology.
However, we still have 60% of the computer room using cool air
from under the floor, and 40% and growing using front-to-back
cooling.
"If we were designing the computer room on a greenfield site, we
would design it differently, but most companies are not in that
position and can only change gradually."
The UPS provides a 20-minute time window in which to shut down
services in the event of a power failure.
No standby generation is available at Canary Wharf, but it is
available at other sites used by the company, which include the
Birmingham Post, Liverpool Post and South London Press.
If there is a power cut at Canary Wharf, Palmer said, "We can
switch to a secondary incoming mains supply fed into different
parts of the building by totally separate substations at Westferry
Road and at Simpson's Road."
According to Gartner's Data Center Facilities Cost Survey: 2006
Update, costs have increased substantially since the last survey in
2002. The need for higher levels of reliability, coupled with
increased mechanical and electrical investment, is driving the
rise.
Increases can be attributed to three basic cost drivers. First,
there is more emphasis on reliability. This concerns the need for
redundant electrical and mechanical systems to eliminate single
points of failure.
The higher the tier, the greater the level of redundancy and,
therefore, higher investment costs for duplicate electrical and
mechanical systems, including multiple UPS systems, generators,
multiple air-conditioning units, and the duplication of electrical
cabling and piping for dual water distribution.
The second cost driver is the increased initial power rating of
high-density computer equipment such as blade servers, storage area
network equipment and disc arrays.
Greater power is required not only for denser IT equipment, but
also for increased air conditioning systems to cool the incremental
heat load.
Finally, more investment in required to harden the datacentre
facility's structure in response to geographic or terrorist
risks.
Obviously a lot of investment goes into keeping datacentres
running. Datacentre operator Globix, for instance, runs its own
electrical substation backed up by five large APC UPS installations
and two Cummings 2.5MW diesel generators.
Paul Court, UK operations director at Globix, said, "We have
just under 100,000 litres of fuel in tanks in the basement."
For the next level of resilience, Globix uses APC units within
the datacentres. Power is taken straight off the grid from
electricity supplier EDF at 11kV and transformed down. This enables
Globix to reduce its electricity bill.
But even with this level of resilience, problems can occur.
Globix experienced a power failure in late July because of a fault
on a high-voltage switchgear panel.
One of the switches at its Prospect House datacentre had failed,
even though it was indicating that it was functioning OK. It was a
mechanical failure. The cams and springs inside the switch had
failed. The result was that the datacentre drained one UPS, then
the other, then stopped.
"The problem is not the fact that the switch failed, but that it
signalled it was OK," said Court. "After a while, we managed to
find out what it was and managed to get the pumps to the generator
going again, and then hand-cranked the switches."
The switches needed servicing - something the manufacturer had
not told Globix. The switches are now six or seven years old, and
it has now come to light that some of them may fail, but the
manufacturer has apparently still not contacted the owners to warn
them.
Telehouse, a co-location facility for major internet companies,
suffered a power failure at its Docklands site in August. The
problem was caused by a circuit overload on one of the phase
supplies, which had burned through, causing a breaker to trip.
Phil Lydford, sales and marketing director at Telehouse, said,
"We had a failure, which fortunately happens very rarely, and which
affected a limited number of customers.
"We now allow a greater margin between the stated capability of
the switch and the load that we put through it, which will ensure
that the breaker does not trip so easily."
Some of Telehouse's customers were unaffected because of the way
they configured their equipment. The whole point of resilience is
to ensure you are not subject to a single point of failure.
If a datacentre draws a load of 8MW, for example, the operator
can install four 2MW generators. N+1 resilience would be provided
by five 2MW generators (four 2MW, giving 8MW, and one spare 2MW
generator), so if one generator failed or was shut down for
maintenance, you would still have power. This allows one failure.
Telehouse's strategy is to support up to two failures.
In some cases it may be necessary to provide what is called N+N
resilience, where the datacentre provides twice the capacity of
resilience it needs.
Lydford believes this level of resilience is not normally
necessary for generators because they are extraordinarily reliable
and run for years, so N+2 is accepted good practice for highly
available datacentres.
"However, when you get to UPS equipment, then N+N is our design
norm, so we cover all potential failures completely," he said. "We
would run the same N+N configuration for things like static
transfer switches, normally on dual power supplies."
It is clear that to achieve high availability in the datacentre,
IT directors need to look not only at the applications and server
infrastructure and service level agreements associated with the IT,
but also at the non-IT infrastructure - the mechanical, electrical
and plumbing systems that keep the datacentre operational.
IT directors need to pay particular attention to the health of
their UPS systems. These control and condition the electrical power
within the datacentre. Any failure here will put the datacentre
offline, unless adequate redundancy is built in.
Sometimes the reason is human error, where staff may be needed
to work after hours and are tired. According to UPS supplier APC, a
common problem is when maintenance staff do not follow procedures
step by step. "Skipping steps, which happens especially with
well-versed personnel, can lead to downtime," it said.
Another common problem is when systems components are replaced
even though there are no signs of wear or failure. This creates an
opportunity for inserting other failures. Likewise, invasive checks
that require the removal of other components can introduce
problems.
So while technology and multiple levels of redundancy can limit
the effect of failure, much of what keeps a datacentre going is
down to the people. Many problems can be avoided simply by
operating a two-person maintenance team.
Case study: Allen & Overy designs continuity into
new systems
Allen & Overy is one of the world’s biggest law firms, with
more than 4,900 employees across offices in 25 countries. In the
summer of 2001, a faulty circuit breaker caused a major power
failure from which the company took a month to completely
recover.
In later years, it became apparent that Allen & Overy’s
datacentre cooling systems were inadequate for the ever-faster
servers it wanted to install. The company’s planned move to a
purpose-built facility at Bishops Square in the City of London gave
it the opportunity to rethink how it managed its datacentres.
Andrew Brammer, head of global IT operations at Allen &
Overy, said, “We designed a datacentre in conjunction with APC and
our business services department to allow us to grow in terms of
the density of the equipment.
“Our day-one configuration is 3kW per rack, day two takes us up
to 5kW, and with little modifications to the room, we can
accommodate higher densities through blade servers to 7kW and
higher by enclosing the aisles, and so on. We made that investment
because, at the end of the day, we cannot move the walls out – we
can only grow within the original confines of the room. We also
outsourced our datacentre with Savvis to the Winnersh Triangle
outside the M25.”
Under the terms of the five-year, £5.5m deal with Savvis, Allen
& Overy was able to consolidate and centralise its UK-based IT
infrastructure.