psdesign1 - Fotolia

Cracking genomic codes with the cloud

Australian researchers are using Amazon’s Lambda serverless computing service to solve pressing health problems

With the genome of half the world’s population expected to be sequenced by 2025, scientists are bracing themselves for the staggering volume of data that they will need to deal with.

All that genomic data, predicted to grow at a rate of 30 exabytes per year, is already being used by researchers such as Denis Bauer to unlock the genomic codes behind diseases such as amyotrophic lateral sclerosis (ALS), which affected renowned British theoretical physicist Stephen Hawking.

Crunching genomic data can be a tedious process. With the human genome comprising three billion DNA “letters”, singling out genes that cause diseases such as ALS across a large sample size is akin to searching for a needle in a haystack.

According to Bauer, an internationally accredited bioinformatics researcher and team leader at Australia’s Commonwealth Scientific and Industrial Research Organisation (CSIRO), previous machine learning techniques could not cope with sheer volumes of genomic data.

Google’s Planet algorithm, for example, is good at solving machine learning tasks involving hundreds of thousands of samples with up to, say, 1,000 data points per sample,” Bauer said. “But we have three billion data points per sample.”

To overcome that limitation, Bauer and her team at CSIRO created VariantSpark, a machine learning library that can be used to analyse genomic data in real-time using the Apache Spark engine for big data processing. VariantSpark can also be used to crunch data in other applications such as transcription.

With the disease-causing genes identified and analysed, the next step is to test the use of a genome engineering technology called CRISPR to edit the genes that cause certain diseases in humans. This delicate task has to be performed with a high level of precision, with no room for mistakes.

To improve success rates, Bauer said it is necessary to speed up the process of identifying where gene-editing can be performed.

“Doing that for one gene is easy and can be done in seconds through parallelisation. But it’s hard to do that for all the genes in the human genome using a web service,” she said, noting that with the Amazon Web Services (AWS) Lambda serverless computing service, it is now possible to “trigger many functions in parallel easily and cheaply enough”.

That said, Bauer, who will be speaking at the YOW! 2017 conference in Sydney, admitted that there are limitations to the Lambda service. “There are only that much data and requests that you can process with Lambda functions, so we had to come up with workarounds to parallelise our workloads,” she said.

Read more about high performance computing

Bauer’s team has had to find ingenious ways to adapt their complex research work to what AWS and others are offering, because cloud providers mostly cater to generic use cases.

Besides crunching genomic data, Bauer is also using AWS to share data with other researchers around the world in a secure manner.

“The data that we upload to S3 storage is encrypted and stays encrypted, and it only gets decrypted on the compute node. We can also have a log file and log audit report to prove that no one else has access to the processing pipeline,” Bauer said.

Moving forward, Bauer does not reckon the likes of AWS will start to offer specialised cloud services targeted at researchers who are often at the forefront of knowledge and technology.

“Even if a cloud provider were to satisfy our needs right now, we would have vastly different requirements tomorrow. That’s also why we were one of the first to adopt cloud and renting the latest technology is what we do,” she said.

Read more on Clustering for high availability and HPC