Giovanni Cancemi - Fotolia

Australian genome researchers solving big data problems

Researchers at the Commonwealth Scientific and Research Organisation have developed a cloud-based tool to look for the needle in a haystack in what they have described as the "datafication of everything"

This article can also be found in the Premium Editorial Download: CW ANZ: CW ANZ: Prepare for EU data law

Australian genome researchers have developed solutions to tackle big data challenges orders of magnitude more complex than anything previously possible.

Speaking at the YOW! 2017 software developers conference in Sydney, Denis Bauer, senior research scientist and research team leader in bioinformatics at Australia’s Commonwealth Scientific and Research Organisation (CSIRO), said the focus on human health had led to “experiments not previously possible using big data”.

To explain the sheer scale of the data problem, Bauer said although there were estimates of two exabytes of data on YouTube and one exabyte on Twitter by 2025, there would be 20 exabytes of genomic data by that time, generated as people sought better health outcomes.

CSIRO has been working on a project that looks at genomic markers of amyotrophic lateral sclerosis (ALS) or motor neuron disease. To do that, it has had to comb through 22,000 DNA profiles or 1.7 trillion data points to look for the “needle in a haystack” that she said would identify someone as predisposed to ALS.

Working with big data and cloud architect Lynn Langit, also a speaker at YOW!, CSIRO has developed and deployed VariantSpark to search through the genomic data. Essentially a machine learning platform for genomic variants that reduces the risk of false positives, VariantSpark would also be useful for other massive data analysis, Langit and Bauer said.

The advent of the internet of things (IoT) and what Langit described as the “datafication of everything” would require the use of a tool such as VariantSpark, which has been deployed on Amazon Web Services (AWS) and could slash the time to perform exploratory analytic work from hours to minutes.

Bauer and her team plan to make VariantSpark available as open source code on Amazon Marketplace in 2018. In the longer term, CSIRO plans to develop a range of commercial tools and services around it, she said.

Meanwhile, CSIRO is also developing tools to provide computational tools to screen embryos with particular disease markers. “Think of this as a search engine for the genome,” Bauer said.

Bauer and Langit delivered the opening keynote for YOW! which has carved out a strong reputation for delivering cutting edge software development insights from well-credentialed global speakers.

Read more about big data analytics in Australia

  • The market for technologies that help organisations make sense of vast volumes of data is hotting up Down Under.
  • Data from analyst firm IDC shows that big data and advanced analytics in Australia could be about to accelerate.
  • Australian businesses need to change their attitudes towards data scientists if they are to unshackle the benefits of data.
  • The Commonwealth Scientific and Research Organisation has upgraded its high performance computing infrastructure to keep pace with global research.

At the Sydney event, other high-profile presenters this year included Woody Zuill, pioneer of mob programming where developers write code together on a single PC.

Zuill has also championed the #noestimates movement on Twitter, believing that estimating the cost, time, risk and value of development work is not very helpful to successful software development.

Home-grown speakers at the event included Data61’s Tony Morris who addressed functional programming in aviation and tackling data management problems for the sector, as well as REA Group’s Ken Scrambler who detailed the company’s four-year initiative in functional programming.

Greer Lucas, digital technology capability manager at NAB which sponsored the conference, said the conference was timely as investment in software was now at a scale not seen since Y2K and that companies were “going into cool things and not just maintenance”.

Read more on Big data analytics

Data Center
Data Management