Compute strategies for AI applications

Devising compute strategies for AI applications can be challenging. Find out about the hardware, network and software frameworks available

Chirag Dekate and Arun Chandrasekaran

Published: 04 Feb 2019

Machine learning compute infrastructures are primarily catered towards organisations looking to build infrastructure stacks on-premise. There are six core capabilities needed in machine learning compute infrastructures to enable high-productivity artificial intelligence (AI) pipelines involving compute-intensive machine learning and deep neural network (DNN) models.

Compute acceleration technologies such as graphics processing units (GPUs) and application-specific integrated circuits (ASICs) can dramatically reduce training and inference time in AI workloads involving compute-intensive machine learning techniques and DNNs. Accelerators should be picked to match application needs, and frameworks must be configured for those specific accelerators to use their capabilities.

While there are diverse accelerator technologies in this market, including NEC Aurora Vector Engine, AMD GPUs and Nvidia GPUs, only a few of them have wide support for machine learning and DNN frameworks. Currently, the DNN training ecosystem is dominated by Nvidia GPUs because high-performance hardware can utilise unique capabilities such as tensor cores and NVLink. There is also a high degree of software integration all the way from libraries to frameworks.

Accelerator density

Compute-intensive machine learning and DNN frameworks are scale-up-oriented. A higher number of accelerators in each compute node can dramatically reduce training times for large DNNs. Compute platforms addressing this market feature a high degree of variance in accelerator densities. Most suppliers support four accelerators per compute node, while performance-oriented configurations feature eight accelerators per compute node. In GPU-accelerated compute systems, some vendors offer 16 GPU compute nodes.

While the most common approach to scaling in compute-intensive machine learning and DNN frameworks tends to be scale-up-oriented, early adopters are also curating scale-out strategies. Uber’s Horovod enables distributed deep learning for DNN frameworks such as TensorFlow and PyTorch. IBM’s Distributed Deep Learning and Elastic Distributed Training are also designed to deliver scale-out capability when model size and complexity grow.

Nvidia’s Collection Communications Libraries (NCCL) also enable multi-GPU and multi-node scaling foundations for DNN frameworks. When selecting scale-out strategies, it is best to select solutions that are pre-optimised, easy to deploy and minimise total cost of ownership.

Because of the high density of accelerators, the manner in which the accelerators are connected to the compute node and how the compute node components interplay with accelerators can dramatically affect performance in compute-intensive machine learning- and DNN-based workloads.

Data ingestion and data exchange are the two types of data movement operations that commonly occur. Data ingest and copy operations to load input data are data movement-intensive and usually require direct involvement of the CPU. As a result, high-bandwidth data movement bus architectures between the CPU and the accelerators are crucial to prevent data bottlenecks. The x86-based compute systems utilise PCIe Gen3 (PCIe 3.0)-based connectivity between CPUs and GPUs. The IBM Power processor natively supports Nvidia NVLink, which enables higher-bandwidth connectivity than PCIe 3.0 interconnects. As a result, systems featuring CPUs with native NVLink support can deliver high-bandwidth connectivity between the base CPU and Nvidia GPUs.

Network connectivity

Large-scale, compute-intensive machine learning and DNN techniques also require fast movement of large amounts of data across compute nodes. High-bandwidth, low-latency networking technologies that interconnect compute nodes can accelerate data movement and can enable some DNN models to scale. From a networking perspective, DNN processing compute environments rely on high-bandwidth and low-latency pooling of GPU resources, together with GPUDirect remote direct memory access (RDMA) capabilities.

RDMA-compatible networking stacks enable accelerators to bypass the CPU complex and, as a result, enable high-performance data exchange between accelerator components. Today, InfiniBand (Mellanox), Ethernet (with RoCE v.1/2), Intel Omni-Path or proprietary network technologies are used for networking.

Machine learning and DNN frameworks deployed on accelerated compute platforms need to be reconfigured with the right set of libraries and supporting middleware technologies to enable utilisation of accelerators. Integrating these technologies from scratch can be incredibly complex and resource-intensive.

Most system suppliers provide preoptimised DNN and machine learning framework containers (such as TensorFlow, Caffe, PyTorch, Spark and H2O.ai) to minimise deployment and integration time. Some of these include:

Nvidia GPU Cloud (NGC): Free and compatible with Nvidia GPU-accelerated platforms. Only Nvidia’s compute systems (DGX1, DGX2) and some public cloud ecosystems powered by Nvidia GPUs are certified and extensively supported. NGC features Horovod, a distributed training framework for TensorFlow that supports distributed deep learning. NGC can be deployed on most Nvidia GPU-accelerated systems. NGC containers also run in Kubernetes-orchestrated environments.
Bright Cluster Manager for Data Science (BCMDS): Compatible with Nvidia GPU platforms and widely offered by most system suppliers, BCMDS also supports Horovod for distributed deep learning. From an operating expenditure (opex) perspective, this capability is mostly offered as an add-on, and IT leaders should evaluate any associated licensing costs over the lifetime of the systems.
IBM PowerAI Enterprise: Currently only available for IBM Power Systems, PowerAI Enterprise offers pre-optimised stacks of open source AI frameworks, integrated support for distributed deep learning and data scientist productivity tools. These tools span the entire model development process, from data ingest to inference deployment. While some features are free, enterprise-scale usage and support may require additional licences, and consequently IT leaders should evaluate any associated licensing costs required for their ecosystem.
Lenovo intelligent Computing Orchestration (LiCO): Lenovo-exclusive LiCO is software designed to provide simple cluster management and to improve use of infrastructure for AI model development at scale on both Nvidia and Intel processors. While LiCO is free to use, support offerings for it are enabled through a per-CPU and per-GPU subscription and support model, so IT leaders should evaluate any additional licensing costs.

When devising machine learning and DNN infrastructure strategies, ensure that the supplier-provisioned container ecosystem supports the core subset of machine learning and DNN frameworks used in your organisation. Select ecosystems that enable you to test the latest GitHub versions alongside stable versions for iterative A/B testing. DNN and machine learning frameworks are continuously improving, and the latest GitHub versions can address key challenges that stable versions might not yet address. Finally, take into account the opex associated with middleware management and optimise the total cost of ownership.

This article is based on an excerpt from Gartner’s Market guide for machine learning compute infrastructures report by Chirag Dekate and Arun Chandrasekaran.

Compute strategies for AI applications

Devising compute strategies for AI applications can be challenging. Find out about the hardware, network and software frameworks available

Accelerator density

Read more about AI architectures

Network connectivity

Read more on Artificial intelligence, automation and robotics

DDN targets enterprise-shaped hole in its AI storage offer

DDN seeks AI leadership as it bags $300m investment

Nvidia drives forward accelerated computing advantage

What role does CXL play in AI? Depends on who you ask