Blackrock: measuring risk in a datacentre

How the investment giant handles the risk of managing $13 trillion of other people’s money

Blackrock is the world’s largest investment house. The company, which controls an investment pile of more than $4tn on behalf of its clients, operates in 90 locations across 30 countries. It also provides the technology platform for a further $9tn of funds managed by other asset houses. 

Blackrock's job is to measure risk on behalf of its clients and ensure it delivers the best possible return on their investments. For this it relies on technology – more specifically, on the 25 million lines of code within its Aladdin software platform.

That platform, which serves 3,000 analysts and 19,000 users, resides in a series of datacentres around the US and the world. The firm’s dealers and analysts measure risk in the markets. The company itself measures risk in its datacentres.

The global financial markets are characterised by speed: instant information, faster data feeds, lower latency and, ultimately, immediate returns. But Blackrock decries its obsession with speed – “we’re not a latency play” – although it acknowledges the technology underpinning its operations, all the way down to its uninterruptible power supplies (UPSs), is changing rapidly. Herb Tracy, the firm’s global head of critical engineering, says: “Today I wouldn’t design a datacentre the way I would have even three months ago.”

For an industry that until five years ago hadn’t changed much in decades, that’s quite a statement. For an engineer whose job is to guarantee uptime for a company in a sector where risk is everything, it is an acknowledgement of the challenges he faces in criticality, sustainability and efficiency. 

People, says Tracy, are products of their experiences. If someone (an engineer) did something 20 years ago and it worked, back then they'd be reluctant to change. This is no longer the case.

The cost of energy is the big line item. Blackrock tries to keep the mechanical chillers switched off for as long as possible. Last year it used mechanical cooling for fewer than 40 hours at its West Coast datacentre.

Managing risk

“We don’t just go out there and measure risk," Tracy says. "We manage risk from the beginning stages. In datacentre terms, that starts with site selection. The due diligence begins with the routine checks for gas lines, railroad tracks, highways and flight paths. 

"The geography is checked for stability and flood risk. We do a lot of diligence to reduce risk. We don't automatically build a tier three or tier four datacentre. We evaluate and build the required resiliency for the applications that will be hosted in that particular facility,” Tracy says.

Another obvious requirement is security of power supply, but what may not be as obvious is a shift to ‘green’ energy. “Recently we’ve tried to stay away from nuclear and coal-powered datacentres. Sustainability is vital to our business and the communities in which we work and live. 

"We manage a lot of money on behalf of our clients and we host many large money owners as third-party clients on our Aladdin software platform. In addition to being an investment manager Blackrock Solutions is a business division which delivers technology services to other large money owners, so we get a lot of questions about sustainability and we take this responsibility very seriously.”

Engineered risk

In engineering risk terms, at first glance Blackrock appears to deliver fairly standard resilience. As a minimum standard Tracy says it offers "n+1" resiliency. “We generally have two independent power trains which have their own generators and UPS. Our independent emergency generator and UPS can handle half of the load out on the A and B cord. We could lose an entire generator line and still have ample cover to run the full operation,” he says.

After the company bought Barclays Capital in 2010 it had 28 datacentres around the world. Then  it began a migration strategy to move to a single platform. Today BlackRock has 11 datacentres and the plan is to get that down to six or eight.

As this consolidation project continues, the fleet comprises a mixture of owned and operated facilities and wholesale colocation space (such as at the Sabey campus in Wenatchee, Washington State, pictured).

“On our US team, we have very strong engineering and IT skills which allows most of our datacentres to be owner-operated. In EMEA and Asia, we generally set up datacentres so that production is colocated and disaster recovery is in owned and operated sites. Our relationship with Sabey is a wholesale lease,” Tracy says. Anything that touches the business is controlled by Blackrock. It maintains a full-time staff on the site.

Today I wouldn’t design a datacentre the way I would have even three months ago

Herb Tracy, Blackrock

Being in a multi-tenant facility, Blackrock built out its presence at Sabey in pods. To give an idea of the scale of the operation, in 2010 Sabey and Blackrock both sought permission for three 2.5MW diesel-fired generators for a portion of the site. Sabey operates a datacentre adjacent to the Blackrock facility. 

Sabey wanted three 2.5MW diesel-fired generators and VMware, which is inside the same shell, had 10 diesel-fired generators already permitted (2MW each) with just three generators then installed. This is alongside the existing T-Mobile datacentre located in the adjacent building on an adjacent parcel. The T-Mobile datacentre gained permission to install and operate up to 20 diesel-fired generators (2MW each).

“Using indirect evaporative cooling and UPSs with highest efficiencies we have been able to run Wenatchee at 1.18 PUE (power usage effectiveness). We are better than industry average now and we think we can beat 1.1,” Tracy says.

Blackrock is at pains to point out that compared with other financial institutions, where the tendency is to run thousands of applications, it tries to move everything to a single platform. “We have one main application platform (Aladdin) and a couple of other supporting ones,” Tracy says.

Generally production is in one location and disaster recovery runs concurrently in another. East Coast production might flip to the West Coast. The traders and portfolio managers never notice where the workload is running.

The relationship with IT is very collaborative and works very well

Herb Tracy, Blackrock

In terms of risks associated with the public cloud, nothing is discounted without reason. But Blackrock has a rule that there will be no client information put on to a public cloud. The public cloud might, however, be used for some development work.

How the team is structured

Tracy’s team is built on a global functional matrix. Regional managers are appointed around the world and it uses third-party engineering companies.

“We elected to combine MEP [mechanical engineering and plumbing] and IT teams so they are part of one group in our East Wenatchee datacentre. In a traditional datacentre the MEP team is responsible for the datacentre envelope and then hands off to the technology teams. Now we all report to the same business head, the COO. Because of that we avoid any facilities/IT conflicts that can be typical in other organisations,” Tracy says.

“The relationship with IT is very collaborative and works very well. How and where we put IT kit in the facility is discussed well in advance of any decisions being made. People with mechanical expertise not only learn the core functions of their own responsibilities but are tasked with being multi-disciplinary. Everyone is required to learn all the skills. For example, we have a switch gear specialist who is required to run fibre and patch servers.”

Build risk

On construction projects, the close control Blackrock maintains also points to its understanding of where the risk should lie.

“We’re not the type of firm that issues specifications, seeks bids, or hires a design engineering firm to build our datacentres. We build the preliminary design, we select the equipment. We retain influence, both direct and indirect, over the choice of equipment and oversee the entire process,” Tracy says.

“Now we want to know the total maintenance cost for the 15-year life span of the equipment. Lower upfront costs are not the key investment driver. The boxes themselves are more resilient. In the old days it was about screwdriver-tweaking the system. Now boxes are digital and remotely controlled.”

Every alarm in the world eventually goes to me after the local teams are engaged

Herb Tracy, Blackrock

Mechanical engineering is no more or less critical but the technology is changing rapidly both at a mechanical/electrical and an IT architecture level.

Tracy offers the example of efficiency at various loads on a UPS system. Five years ago the best results were 92% efficiency at 100% load. Now “we have 97% efficiency at 100% double conversion and it doesn’t drop off until we go below 20% load,” he says.

Personal risk

Like any other financial services organisation Blackrock’s datacentre operations are subject to external auditing. Maintenance records, repair records and alarms systems are checked regularly. These cover topics such as how individual alarms are communicated, initial response and how incidents get closed out.

The company has an in-house designed datacentre infrastructure management (DCIM) system and Tracy is the man on the frontline. His phone is the on that rings if problem should arise. “Every alarm in the world eventually goes to me after the local teams are engaged,” he says.  “If there is something that could potentially impact our availability I will be informed immediately.”

A longer version of this article appears in the March/April 2014 edition of DatacenterDynamics Focus

Read more on Datacentre backup power and power distribution