The US’s Oak Ridge National Laboratory has deployed the ultimate in high performance computing (HPC) storage, with more than 40PB of DataDirect Networks (DDN) SAN capacity providing 1TB per second (TBps) of throughput to support the world’s fastest supercomputer, Titan.
Titan is operated by the National Center for Computational Sciences (NCCS) and is used by a number of programmes that analyse data in astrophysics, materials research and climate.
Titan has 18,688 compute nodes, each comprising an AMD 16-core Opteron 6274 processor for a total of 299,008 CPU cores, that is linked to an Nvidia K20X graphics processing unit (GPU) and plugged into the NCCS’s Cray XK7.
Titan is capable of 27Tflops – 27 trillion calculations per second – performance and is the hardware host for the Spider II file system. This is built on the Lustre open source parallel file system that scales to billions of files.
It was the move to better-performing hardware, and to Spider II from Spider I, that necessitated the move to upgrade Titan’s storage.
Spider II moved from Lustre version 1.8 to version 2.4 and enabled the file system to handle 40PB of capacity, up from 10PB in Spider I.
The move to the beefed-up Titan and new version of Spider with its ability to handle greater capacity meant an upgrade to the storage was required, said Jim Rogers, director of operations at the NCCS.
“We try to maintain a ratio of 50:1 between capacity and memory and the new Titan had pushed us to need 15PB as a minimum. In the end, in negotiation with DDN, we got 100:1,” he said.
The DDN storage comprises 36 DDN SFA12K-40 systems, each with 1.12PB of raw storage capacity and a total of around 40PB in 20,150 spindles of nearline-SAS drives; that is the equivalent of enough stacked books to reach the moon, apparently. These are configured on RAID 6 with storage connected to hosts via 16,000 ports using FDR Infiniband.
The NCCS tendered for storage for Titan with a budget of around $10m and performance metrics of 15PB of storage capacity with thoughput of 1TBps. It looked at three types of solution: clustered NAS; a SAN back end to which it would provide the front end file access, and; purpose-built appliances using the Lustre file system.
Jim Rogers said: “We found that if we wanted the capacity and the throughput, it drove things past our cost threshold. We had to choose performance or capacity and the best value for our $10m was to get block storage [SAN] where we completed the other elements. There are lots of Lustre appliances in the market but we wanted to see that market mature for four or five years.”
Pre-configured hardware built around the Lustre file system include EMC’s VNX HPC appliance, the Dell Terascala HPC Storage Solution, DDN’s ExaScaler, and Xyratex’s ClusterStor.
These products are essentially NAS hardware in that they combine storage capacity with file system support and access in one hardware bundle. Clustered NAS products are similar but usually run the supplier’s own parallel file system and are not purpose built for the highest end of HPC workloads.
The NCCS chose a different route. It has deployed the DDN storage as block access, ie SAN, storage to which it has provided a front end of its own Lustre-based file system and the processing power it requires.
Titan replaced its predecessor Jaguar, which had been the world’s fastest in 2009/2010, with 2Tflops of operating performance.