psdesign1 - Fotolia

Big data platform to speed up DNA analysis

Belgium’s Interuniversity Microelectronics Centre has developed a big data platform that can analyse DNA data up to 16 times faster than current techniques

Researchers at Belgium’s Interuniversity Microelectronics Centre (Imec) have developed a big data platform that can analyse DNA data up to 16 times faster than current methods.

Called ElPrep5, the platform is aimed at users in the pharmaceutical industry, scientific research, medical laboratories, sequencing service providers, sequencing suppliers and hospitals.

Imec said ElPrep5 can perform DNA analyses eight to 16 times faster than the widely used genome analysis toolkit (GATK), from data preparation to variant calling on a similar hardware infrastructure.

Much of the speed improvements are in variant calling, a process where reconstructed DNA fragments based on the original DNA sample are analysed to detect genetic variants.

Performing this analysis is a computational-heavy challenge. Despite substantial cost reductions for DNA analysis over the past decade, doing so could take up to two to three days for a whole genome. Imec claimed that ElPrep5 can perform a whole genome analysis within a few hours without compromising the quality of the output.

Roel Wuyts, a principal scientist at Imec, said ElPrep5’s performance advantage came from a rewrite of the sequencing pipeline. “ElPrep’s software architecture internally fuses the execution of the user-selected steps, highly parallelises the algorithms and implementations of these steps, and fully takes advantage of large memory capacities when they are available. It is especially the integration of these three techniques that results in this fast execution speed.

“This integration also implies that a whole sequencing pipeline is formulated by the end user as a single command-line invocation. This makes ElPrep5 significantly easier to use than the more common approach of having to script multiple different tools to implement the various steps of a pipeline,” he added.

Read more about big data in APAC

elPrep5 is written in Go, an open-source programming language developed by Google, and can be run on standard servers that most hospitals have locally or in the cloud. The choice of Go arose out of a study which found that it has the best balance between runtime performance and memory use compared to C++ and Java.

“This is the breakthrough we have been anticipating for years. Finally, we can run the entire DNA analysis pipeline with a single software platform solution, and faster than ever,” said Imec researcher Charlotte Herzeel.

“Because variant calling is the most complex step, gathering results up to 16 times faster than the previous method has resulted in a four- to nine-fold reduction in time, all while retaining GATK-identical results.

“For the medical sector, this allows massive efficiency gains because the time between sampling and diagnosis dramatically decreases and doctors can run analyses overnight. Moreover, since many hospitals run their analyses via rented cloud solutions, the reduced throughput times can immediately result in a cost reduction per analysis,” she added.

As a research organisation, Imec delivers new technologies to partners though intellectual property licensing. In the case of elPrep, an open-source licence, as well as a premium licence that lifts open-source restrictions and offers priority support, are available.

“This means users can support continued development by giving feedback or contributing code for the open-source version or supporting us financially. As such we can realise further elPrep improvements and further expand its functionalities,” said Wuyts.

According to technology research firm IDC, APAC revenues for big data and analytics solutions are tipped to hit $22.6bn in 2020, with three in four enterprises planning to keep or increase their big data analytics investments this year.

Read more on Clustering for high availability and HPC

Data Center
Data Management