Sergey Nivens - Fotolia
The organisation opted to buy out-of-the-box storage from a mainstream supplier, rather than build its own with commodity hardware and an open-source file system for the high performance computing (HPC) project.
The 100,000 Genomes Project was set up to explore how NHS patients could benefit from genomics – the study of the structure, function and meaning of DNA – including looking at how diseases are inherited through genes passed down from one generation to the next.
The project will sequence 100,000 genomes from around 70,000 people, including NHS patients with rare diseases, plus their families, and patients with cancer.
The essence of the compute element of the project consists of pattern matching between very large files.
This dictated a requirement for a massively scalable parallel file system in which capacity and performance could be added, while allowing a single file system to grow with it.
This could be achieved with an off-the-shelf clustered NAS system or by building an open-source file system on commodity x86 server hardware.
Often in high-performance computing projects, file sizes are quite small and data doesn’t need to be kept close to compute nodes. But in this case it made sense for storage to be very close to the CPUs, so that data transfer of files up to 240GB each was not necessary and data could remain in place.
Reviewing the options
So, with these requirements in mind, Genomics England and capita S3 canvassed suppliers for products and spoke to other similar projects to see what had worked for them.
Mark Smith, managing director with Capita S3, said: “Head of informatics infrastructure at Genomics England, David Brown, did a beauty parade of storage suppliers and also went to peers to see what they had to say. He talked to the European Bioinformatics Institute (EBI) about how they got started with an open-source parallel file system.”
“This had provided a low cost initial investment but was high on management overhead and, as the system scaled, the management overhead had become overly burdensome. EBI suggested a turnkey product might be suited to an organisation with few IT resources like Genomics England.”
Following this evaluation process, Capita S3 deployed around 7PB of EMC Isilon clustered NAS to two mirrored sites at Corsham in Wiltshire, and Farnborough. The Isilon installation consists of around 50% each of nearline (NL) nodes with nearline-SAS drives and general purpose (GP) nodes with mixed flash storage and nearline-SAS drives.
Read more about HPC storage
- We survey the key suppliers in HPC storage, where huge amounts of IOPS, clustering, parallel file systems and custom silicon provide storage for massive number crunching operations.
- The Wellcome Trust Sanger Institute has invested in DDN SFA10K HPC storage to support petabytes of big data generated by genetic mapping and research into diseases such as malaria.
Smith said: “What we’ve built is an infrastructure where strings of data can be processed in place. It’s a relatively straightforward processing task. Data never leaves the storage and this contributes to a good degree of security also, which is important.”
Patient confidentiality and protection is a cornerstone of the project, with data stored and accessed under strict research governance and ethical frameworks. Data is held in a secure, monitored data environment and access to this environment will be granted only for specific and approved purposes by Genomics England.