How HPC and AI are driving physical change in datacentres

Senior Editor, UK

In this follow-up guest post Paul Finch, CEO of Harlow-based colocation provider Kao Data, sets out how datacentre designs are having to change to accommodate evolving chip densities and increasingly data-heavy workloads.

As a datacentre operator, we view two concurrent pathways leading us into the future. The first is the building and infrastructure, while the second is the compute power and connectivity contained within. These paths do not run parallel, but touch, overlap and interweave, especially in recent times as we see the interesting transition between the ‘edge’ and ‘core’ of compute.

Today the datacentre sector must provide expert understanding of the most advanced computing system requirements needed to host and power AI applications. Externally, this is driven by low-latency connectivity and high-capacity dark fibre routes that distribute the data exactly as and when it is required.

Internally, high throughput data networks must include node-to-node inter-connectivity using networking such as Infiniband from Mellanox/NVIDIA for direct support of High Performance Computing (HPC), linking the clusters required for parallel processing workloads.

What has become clearly apparent is that within HPC, one-size does not fit all. Customers with HPC based applications need customisable architectures that are future-proofed to flex and scale as the hardware and server densities change.

GPUs are further supplemented to boost performance and storage requirements evolve. Power generation has to meet the demand for the most intensive forms of AI, such as deep neural networks, machine vision and natural language translation. This surge in energy also requires a more synergistic cooling technology; able to cope with significant increases in heat from the latest generation of processors and their associated electronics.

Liquid-cooling direct to the chip will increasingly become the norm and datacentres will need to be ‘plumbed-in’ to cater for this transition.

The Uptime Institute recently confirmed the average Power Usage Effectiveness (PUE) ratio for a datacentre in 2020 is 1.58, which is only marginally better than it was seven years ago.

At a time when average industry PUE ratings appear to have plateaued, clearly we must pay closer attention to the changing needs of our customers, ensuring a keen eye is kept on potentially escalating energy use and carbon emissions.

What enterprises want from a colo datacentre

Situated in the heart of the London-Stansted-Cambridge Innovation Corridor – one of the UK’s hotbeds for HPC and AI use – many of our customer conversations revolve around specific requirements of their needs, which we group into three main narratives:

HPC and GPU-powered AI requires specialist compute capabilities, which are exceptionally power-hungry and reliant on additional infrastructure technologies. They require specialist interconnect, dedicated server or chip cooling and extreme storage such as Hierarchical Storage Management (HSM) to support the high throughput needs of applications whilst optimising the cost of the overall system.
Legacy datacentres, which make up 95% of currently available UK facilities, were not designed to support HPC compute and its infrastructure. Most were designed for low density enterprise servers generating >10kW per rack, which in comparison are ‘plug and play’ to HPCs ‘bespoke’ needs and 50-80kW per rack. Many traditional datacentres are dedicated to mechanically chilled, air-cooled strategies, which are expensive to run and inefficient at cooling HPC environments.
Specialist datacentres, inspired by hyperscale and Open Compute Project (OCP)-Ready infrastructure, provide slab floors for heavier compute-dense servers, wide access data halls with no columns and no step access to optimise room layout. Providing overhead power and connectivity infrastructure provides customisation, whilst efficient industrial scale direct-to-chip liquid and hybrid air-cooling maximises heat extraction and efficient energy use, thus reducing Operational Expenditure (OpEx).

AI applications require bigger, hotter chips that can change the form-factor of the servers, which impacts the designs of racks, chassis and enclosures. These processors consume up to 50kW per server.GPU Performance on this scale generates heat that air cooling cannot extract efficiently

This change in processor power and the increase in energy usage, demonstrates the need for datacentre operators to collaborate with key industry organisations, such as ASHRAE TC9.9, the OCP and the Infrastructure Masons.

Chip and GPU manufacturers such as AMD, Intel and NVIDIA, are also members of these organisations and contribute to the development of guidelines that form the basis of much of the best practice in our industry.

Involvement in these committees provides intricate insights and a detailed understanding of the future road maps, including the capabilities required from data centres to drive optimisation in the most effective and efficient environments.

In my opinion, being involved in key industry committees ensures that Kao Data remains on the crest of the wave, first to gain insight on where the technology is heading and our campus design can continually be aligned to ensure technical excellence.