polygraphus - Fotolia

Genomics big data compels IT to rediscover efficiency techniques

Genomics is generating big data at such a scale that research institutes such as the Sanger in Cambridge are having to rediscover storage techniques from the past

This article can also be found in the Premium Editorial Download: Computer Weekly: The most influential people in UK IT 2015

In the sequencing centre at the Wellcome Trust Sanger Institute near Cambridge sit seven capillary DNA sequencing machines. They were among those used in the late 1990s and early 2000s to create the first map of the human genome, a scientific milestone jointly announced by prime minister Tony Blair and US president Bill Clinton. But the machines – which appear to be running on Windows XP – will be decommissioned in the autumn of 2015.

This isn’t because the Sanger is getting out of the business of sequencing DNA, but because the world-leading lab is moving to faster equipment made by Illumina. The lab already has ranks of HiSeq 2500s and HiSeq Xs, the latter costing $1m each. The new machines run Firefox on Windows 7 and feature a horizontal line of flashing blue lights reminiscent of KITT, the intelligent car in the TV series Knight Rider. They can sequence thousands of human genomes every year, each churning out more than half a terabyte of data every day.

Genomics is shaping up to be the biggest generator of big data. A paper published in PLoS Biology in July 2015 predicted that, by 2025, astronomy research – with regard to which, telescopes have been producing vast amounts of data for more than two decades – will require an Exabyte of storage each year. But genomics will need between two and 40 Exabytes annually, as DNA sequencing becomes a standard part of medicine for hundreds of millions of patients every year.

The extraordinary computing requirements of genomics are leading those who run its IT to rediscover efficiency techniques that had become largely irrelevant as computing power increased under Moore’s Law. A recent conference, hosted by the Sanger Institute in Cambridge and sponsored by storage supplier DDN, discussed some of these efficiency techniques – as well as regulatory threats.

The German Cancer Research Centre is, like the Sanger Institute, investing in Illumina HiSeq Xs. Referring to the PLoS Biology paper, head of data management for theoretical bioinformatics Jürgen Eils told the event that the centre will soon generate 11TB a day, nearly as much as the 12TB produced by Twitter worldwide. He described IT as “a major bottleneck”. The centre is speeding up its in-house network to 40 Gbps, but is finding problems in input/output processes. “We are more waiting than running sometimes,” Eils says.

Recoding pros and cons

“Informatics is the bottleneck of genomics,” agrees Shane Corder, high performance computing systems engineer at the Children’s Mercy Hospital in Kansas City in Missouri. He says one in six of the children in Kansas City hospitals is there because of a genetic disease, and that speeding up genetic tests – some of which have taken months without the use of next-generation sequencing carried out with equipment such as HiSeqs – can provide faster diagnoses at a much lower cost. These can be life-changing: Corder says one new-born child in the intensive care unit, who looked likely to die, had a 48-hour genome test which identified a treatment that let him go home and live a relatively normal life. But they rely on fast computing.

Efficiency techniques look likely to provide part of the answer. Robert Esnouf, head of the Research Computing Core at the Wellcome Trust Centre for Human Genetics at the University of Oxford, talks about the challenges of scaling up “academic codes” to thousands of genomes. Routines that were developed by researchers on workstations for use on tens of genomes are now used at industrial scales for which they were never designed.

Read more about genomics and big data

But rewriting is not always the answer, he warns: “The problem with recoding is that it’s slow and risky. If you change code you may not be sure you get the same results unless you rerun all your old research. Changing the statistical guts of these codes is often not feasible.” Instead, Esnouf says tuning other things can make big differences quickly, such as adjusting the sizes of caches to cope efficiently with the maximum number of open files: “A lot of memory is often the best thing you can have,” he says.

Older code often tries to write tiny amounts to disk on a regular basis, which Esnouf says is a reasonable thing to do on a Macbook but very inefficient on a high-performance system. The answer may be to divert such requests to RAM instead, or force buffering behind the scenes. Even file format choices become important. Zip files always have to unzip from the start, but there are alternatives which can unzip from a specific point.

Repacking job requests can make a huge difference, Esnouf says. One user wanted to run 1.8m jobs, taking more than 2,000 core years. The staff at the centre cut this to 27 core years through rearranging the scripts, searching out optimised libraries and distributing data across a pool of RAM disks. Optimising the run time of jobs can also help. It takes the centre’s systems about 15 seconds to start each one, so running half a million one-second jobs is fantastically inefficient. “Batching the jobs with simple scripting changes can recover all this lost performance,” he says.

Data analysis vs source storage

Esnouf is currently preparing the centre’s systems to work with the university’s new Big Data Institute (BDI), due to open in November 2016. But this is subject to a different kind of constraint: “There’s not enough electricity in Oxford.” The BDI is looking for new datacentre space – probably outside Oxford – and will announce its plans over the next few months. Anything should be an advance over the current location for the centre’s hardware, however: “We don’t have a datacentre, we just converted a small freezer morgue.”

Tim Cutts, the Sanger Institute’s head of scientific computing, says it has been making progress on efficiency. Its datacentre, which has around 25-30PB of total usable storage, operates with some 17,000 CPU cores, and the latter figure has not changed for some time: “We’ve come to the realisation that code optimisation is kind of important,” he says. “We’ve managed to convince some groups to be more efficient.”

Another option is to cut the volumes of data by storing analysis of medical records, rather than data-heavy source material. “You don’t need the PACS image,” says Alf Wachsmann, head of IT department at the Max Delbrueck Centre for Molecular Medicine in Berlin, of medical images from picture archiving communication systems. “You need to know what’s on the PACS image.” This requires data mining, which he would like to use to analyse doctors’ letters for meaning, rather than merely their words: “Ideally, you need to do some natural language processes, because if you search for ‘cancer’ you find letters saying, ‘X has no cancer.’”

As well as causing problems for the performance of systems, the sheer volume of data in genomics militates against using techniques including cloud computing. “You can’t get it over the internet,” says Wachsmann. This is problematic even when transferring between two institutions in Germany, he adds. It may make more sense for his centre to move computation to the German Cancer Research Centre in Heidelberg. More than one speaker referred to sending storage devices by courier as an alternative to online transfers.

EU could scupper genomics research

Wachsmann highlights what he believes to be a major danger to genomics research in Europe, from European Parliament plans for the next data-protection directive that would require specific consent for each use of genomic data. “The US may laugh at this a little bit,” he says, adding: “If the law goes through as now, we may as well forget about this.”

“Privacy activists are very paranoid about sharing data, then have a Fitbit and are on Facebook anyway,” he says. Such activists can have a powerful influence of politicians, but Wachsmann hopes a sensible solution can be is found. “I’m optimistic. I come from high-energy physics. The particles don’t mind.” One option may be to allow researchers to query other people’s data without seeing the actual information, he adds.

The Wellcome Trust Sanger Institute says it shares concerns over the European moves. Research policy adviser Sarion Bowers says in a statement that the proposals would “pose a serious barrier to future genomic research and will hinder its use as valuable tool for healthcare”.

Bowers adds: “The European Parliament's proposals would require us to get consent from every single individual whose genomic sequence is held in a database every single time a clinician or researcher wished to access their data. We feel the European Parliament's proposals undermine the publics' rights to choose how they share their data and engage with research, and will be highly damaging to research across Europe and impact on our ability to collaborate around the world. We strongly favour the commission and council’s positions which would allow individuals to give broad consent to cover a range of research activities.”

Regulatory issues in general are causing problems for genomics, the institute’s Tim Cutts says. “People are very frightened about what could happen,” he says, referring to a regular gathering of UK government CIOs known as ‘the Daily Mail meeting’ to discuss things they were doing that might upset the tabloid newspaper. “The fear of what might go horribly wrong is what’s paralysing people.

"People are understandably concerned about what could happen to their personal data. However to maximise the impact of healthcare data it is essential that we have access to, and share this data, so the public perception of the security of genomic information is critical.”

Read more on IT strategy