Meet the men and women who keep AT&T's network running

Somewhere in the UK, a crack team sits waiting to spring into action to support, protect and repair AT&T’s global network

The first thing visitors to AT&T’s global network disaster recovery (NDR) facility need to know, is that you can’t talk about AT&T’s global network disaster recovery facility. Indeed, to all intents and purposes, AT&T’s global network disaster recovery facility doesn’t exist.

All we are officially allowed to say is that it is “somewhere in the UK”. For carefully screened visitors prepared to sign an non-disclosure agreement, this means a breakneck cab ride through the countryside to an anonymous, out-of-the-way warehouse which, like Doctor Who’s Tardis, is much bigger on the inside.

The secrecy is necessary, says Justin Williams, director of AT&T disaster recovery, because the work his team carries out all over the world, at the drop of a hat – and for hat read point-of-presence (POP) site – can and does literally keep the internet running.

Williams’ team is responsible for every AT&T geography outside its domestic US market, including Canada. So how come it wound up in Britain?

“The UK has a very low risk profile for us in terms of its location,” says Williams. “It’s good value for money, and it’s close to our key engineers.”

AT&T used to run all its NDR operations out of the US, but decided to consolidate its foreign activity into one site to be able to quickly deploy from a major airport.

It picked the UK as much for its location as anything. Williams explains: “The vast majority of locations are reachable here as quickly as anywhere else.

“If we had put our equipment in Australia, we would still have to put it in an aircraft to move it around. If we were in Singapore and wanted to recover a node in Australia we’d still have to get to an airport.

“Our timescales on recovery aren’t measured so much in minutes as in multiples of hours, so you’re not losing that much time, and what we gain in maintenance and looking after the equipment far outweighs the disadvantages of time,” he says.

Read more about AT&T

  • AT&T SDN efforts bear fruit
  • AT&T channel chief: partner programme shaping up
  • Carriers move to virtual routers; AT&T's business mobile push
  • AT&T CIO: customer relations will be 80% digital in 2020

Risk assessment

The NDR team’s most fundamental mission is to keep AT&T’s global network available, all the time, and this mantra is constantly pushing Williams and his crew in how they approach their processes and procedures.

The AT&T network operates across hundreds of sites in 60 countries, and carries very large amounts of customer data. It is truly the business’ most critical asset, and so the NDR team is constantly probing the network, looking for faults and problems. These can be something as mundane as a farmer ploughing up a cable, to political turmoil or natural disaster.

“We have a risk assessment process that analyses every possible problem, not only from a technical perspective but from a geopolitical perspective, looking at natural and man-made threats and risks,” explains Williams.

“We’ll look across the whole network to get a risk profile of where we’re at, and then we’ll start planning and investing in various parts of the network.

“But to take a view that the whole network is at high risk is committing you to such a high level of investment that it almost makes it impossible, so what we do then is target the investment in high-risk countries. All the time we’re getting a view of the risk landscape to understand where we need to put our investment,” he says.

At the time of Computer Weekly’s visit to the AT&T facility, the firm was in close contact with staff in Hong Kong, where pro-democracy student protests were shutting down government functions and drawing the ire of Beijing, and the eyes of the world. Had the protests spilled over to affect the network, there was a good chance the NDR team would have been mobilised.

“Political and social and events are frequent and we have to plan for those,” says Williams. “That is more challenging outside the US because of the diverse nature of the geopolitical environment, not only from a logistical challenge of moving equipment across borders, import, export licences and so on, but also dealing with licensing issues.

“So in the same way we probe the network, we probe around the world as well, and we do that around events, such as the World Cup, where an enormous amount of activity went into securing the network.”

A study in network availability

  • Olympic Games, London, 2012: AT&T liaised with event organisers and local authorities in London and conducted extensive testing to ensure it would be able to cope with the demands made on the network by spectators, athletes and officials, and also to make sure it would have access to one of its POP sites, which was entirely surrounded by the marathon route.
  • Earthquake, Chile, 2010: After a magnitude 8.8 earthquake shook Chile, AT&T recreated and connected its Santiago site in a car park outside the city as a precautionary measure to ensure its network stayed up and running, even though the facility was not lost. The NDR team could not fly into Santiago itself, so it flew to another location and trucked the network for two days overland to get there.
  • Hurricane Katrina, New Orleans, 2005: AT&T teams deployed over 3,000 generators into a major disaster zone to maintain its network, but also to help provide emergency communications for the city government. Among its successes were keeping a local airfield operating, and supplying communications for a temporary jail facility.

An always-on backup network

So what does an emergency backup network look like? Apart from the fact that it’s all strapped to airline freight pallets or the back of a fleet of nondescript vans and lorries, not all that different to a regular one.

In fact, says Williams, the two are identical. The UK NDR facility contains an always-on POP, one of the larger ones in AT&T’s global network, running as normal, 24 hours a day, 365 days a year. It just happens to have a couple of hundred wheels attached to it.

Its current UK facility holds 15 semi-trailers and a host of smaller support vehicles, which are able to deploy anywhere in mainland Europe and as far away as the Middle East, with a minimum of notice.

Once on site, the highly trained team – made up of managers, technicians and volunteers drawn from all over AT&T’s organisation – can deploy a high-capacity core router network, supporting data traversing the network at rates of up to 100Gbps.

At full whack, AT&T’s IP trailers can scale up to 15Tbps, enough to handle most eventualities, from a highly connected city that has been knocked offline, to large-volume data transfers between corporate client sites.

Before rolling out, the team will try to do as much work in the UK as possible to mirror exactly what has been lost, to be sure that it takes only as many trailers, with the right equipment on them, as is needed.

It doesn’t stop at network hardware, though. AT&T cannot rely on local power infrastructure in the event of a major disaster, so it carries its own generators, too.

“I don’t want to arrive in a country and be a burden to an already over-burdened infrastructure. We try to be as self-sufficient as possible,” says Williams.

“We’ll turn up with multiple generators, multiple ways of sourcing and finding fuels. We have tents, we have ration packs. We don’t need a hotel. We set up a site and we will live there for as long as necessary.”

Beyond ration packs, the auxiliary infrastructure carried by AT&T includes everything you would need for a successful camping trip.

This includes solar showers and portable toilet tents, medical equipment, to cope with anything from a dodgy stomach to a full-blown cardiac arrest, and even emergency dental supplies. There will always be trained first-aiders and often a nurse on site to make sure the team on the ground is well cared for.

As a vital necessity, the command and control van also includes a very large tea urn.

Theoretically, the network AT&T rebuilds could be in place for a long time, particularly where the facility has been completely destroyed – the 9/11 terror attacks being a good example – where a whole new site will need to be deployed.

This could take anything up to a year and, although the NDR team will usually hand off to local personnel in a very lengthy deployment, there is no average timescale.

Read more about disaster recovery

Exercising the volunteers

For Williams, a key part of building the team is to recruit the best volunteers he can. They come from all over AT&T’s organisation.

New volunteers joining the team are first exercised a few times on one of the NDR’s regular test deployments, which take place three or four times a year all over the world. This gives the permanent squad a chance to decide who has what it takes to join their ranks, and it can be a tough job. Volunteers may be called on to work in very hazardous situations, in extremes of heat or cold.

Successful applicants are trained up as needed across the equipment and processes, before being made available for 10 to 20 days a year. Places on the programme are in high demand.

“From a personal development perspective to have a day job in HR, finance or marketing, and to be able to go away for a few days a year and do something different is very rewarding,” says Williams.

“I spent a long time as a tech product manager developing documents. Then this opportunity came along and when you get something that’s tangible, the opportunity to build something you can touch and feel. It’s a very rewarding job.”

Read more on Network routing and switching