agsandrew - Fotolia

UK bioinformatics team builds research-sharing cloud on OpenStack

Genomics researchers must replicate the hardware and software environment of bioinformatic data analysis to reproduce results – which is where cloud can help

Four UK universities have joined forces to tackle the problem of sharing genomics research effectively.

While researchers can collaborate and share genomics data, often the analysis of this data in the field of bioinformatics requires bespoke software running on custom hardware configuration. Unless researchers can replicate the exact hardware and software environment used in the original experiment, reproducing the results of the experiment may be impossible to achieve.

"One of the issues increasingly faced by bioinformaticians is the ability to share bespoke software applications. What may work on one high-performance computer (HPC) system, won't necessarily work on another – it's often quicker to write a new application from scratch," says Cardiff senior lecturer Tom Connor, who has designed cloud-based infrastructure to overcome this challenge.

Working in conjunction with high-performance data-modelling systems integrator OCF, Connor – along with researchers from the University of Birmingham, the University of Warwick, Cardiff University, and Swansea University – developed a cloud-based infrastructure for microbial bioinformatics research.

The Cloud Infrastructure for Microbial Bioinformatics (Climb) project aims to create the world’s largest single system dedicated to Microbial Bioinformatics research.

Custom coding is widely used in bioinformatics research. According to Connor, this coding enables the researchers to integrate existing software applications, allowing them to develop new ways of thinking and working. He says: "You subdivide a bacterial population. If you have two bacterial strains – one causes an invasive disease; the other is so serious –you may want to place the genomic sequencing data into one of two buckets."

The code – often written in Perl or Python – is shared by researchers, who upload the programs they develop onto the github open-source library. He says: "If you do search for bioinformatics on github, you will get 4,000 hits."

Sharing software for experiments

But while bioinformatics researchers are very good at sharing software and data, Connor says it is very hard to make the code re-usable. Cloud computing enables researchers to share both the data or application and the IT environment it requires. "We think that using cloud as a large shared infrastructure provides us with a way to share software. You can share a snapshot of the environment and use a VM or containerise to modularise the database or the software, which then becomes an object that can be re-used," Connor explains.

Read more about bioinformatics

The University of Virginia Tech has used Microsoft Azure cloud with parallel laptop computing to run DNA sequence analysis.

The Wellcome Trust Centre has deployed a high-performance computing cluster based on Fujitsu servers.

Connor believes that enabling researchers to share their software and data, bioinformaticians can spend more time doing research, and less on installing software and downloading data from multiple, disparate data repositories.

He says: "There is a skill gap in biological sciences. Researchers don’t have the time to upskill. A lot of colleagues do not want to use the command line. They want a GUI."

In Connor's experience, researchers often run individual servers and do not worry too much about data protection, server failure or disaster recovery. He says: "Climb provides all this out of the box. The system takes care of all admin tasks."

Powered by OpenStack cloud computing software, and provided by HPC, big data and predictive analytics provider OCF, one site’s system, at the University of Birmingham, is already in production running OpenStack Juno – and will soon be linked directly to the system at Cardiff.

The system comprises 7,500 virtual CPUs and is powered by OpenStack cloud software. The other two sites are currently undergoing final testing by OCF before entering production.

The hardware powering Climb

In terms of hardware, Climb comprises HPC clusters across the four separate universities using Lenovo System X servers, IBM Spectrum Scale storage with 500TBytes of local storage at each of the four sites, connected through 56 Gbps Infiniband. Red Hat will be connecting all these systems together using OpenStack to create the "cloud" part of the system.

Connor says the system is already helping to track viral and bacterial pathogens, develop new diagnostics and increase medical understanding of bacterial resistance to antibiotics.

One of the early adopters of Climb is the Medical Research Council system. Connor adds: "We are in an early adopter phase, testing and tuning the system and working to present the service using an AWS-like dashboard."

Public Health England is also looking to put its routine genome sequence data for bacterial diseases into a Climb VM.

Read more on Clustering for high availability and HPC