Virginia Tech powers DNA analysis with PC parallel computing and Azure

The university of Virginia Tech has used Microsoft Azure cloud together with parallel laptop computing to run DNA sequence analysis

Virginia Polytechnic Institute and State University, more often known as Virginia Tech, is using high-performance computing (HPC) in the Microsoft Azure cloud, together with parallel laptop computing, to support a cancer research programme.

Wu Feng, a professor of computer science at Virginia Tech, said: “We looked at trying to empower cancer biologists to tackle problems they would not be able to tackle by unleashing the power of parallel computing to the masses, enabling discoveries to be made faster.”

Feng said the computer industry is unable to keep up with the data processing requirements of DNA sequencing. "DNA sequencing has accelerated faster than we can compute. The amount of data coming out of DNA sequencing doubles every six months, while computing power only doubles every 24 months."

The university has used Microsoft Azure to enable Feng and the team of researchers to keep up with DNA sequencing data growth.

“When I first started this research, we doubled DNA data every 12 months, then it went to nine months, then to six,” said Feng.

From a computational perspective, quadrupling the technology is cost prohibitive, since the requirements of the DNA researchers outpace the economic model based on Moore’s Law, which the IT industry generally follows. 

While it is possible to buy twice as much for the same financial outlay every 18 months to two years, as stipulated by Moore’s Law, according to Feng the researchers needed four times as much IT every two years.

More articles on HPC

“We needed to look at innovative tools to avoid having to quadruple IT resources every six months,” he said.

An exhaustive search on the genome involves a medium-sized dataset, but Feng said processing involves “big compute and big output data”. He said next-generation DNA sequence analysis will involve big input data, big compute, and the application could output anything from small to very large datasets.

The university tried building a proof-of-concept application using its supercomputer, called HokiSpeed – a 400-processor system with 2,400 CPU cores and 400 GPU cards, where each GPU card used 400 cores. This machine made the supercomputer Top500 list a few years ago but, according to Feng, it timed out when the researchers tried to run the proof-of-concept DNA analytics application.

Instead, Virginia Tech has deployed laptop-based parallel computing plus analytics on the Azure cloud to increase its HPC capability by 50%. Rather than use a supercomputer, Feng’s team developed a hybrid cloud prototype application, where the parallel CPU cores in the laptop are used for pre-processing, while the application concurrently transfers to Azure and processes using HDInsight, Microsoft’s cloud-based Hadoop distribution.

“We took next-generation sequencing software then ran that on a laptop or PC using the Azure cloud and Hadoop tools,” he said.

Data from a single genome can be compressed to 2.4GB. Given the university’s 1Gbps link and 40Gbps network trunk, Feng estimated that it would be possible to process 100 genomes a day, representing 240GB of DNA data being uploaded to Azure.

Read more on Big data analytics