Completed in 2003, following years of research and development, the $3.8bn human genome project was the largest ever collaboration in biology ever undertaken. Technology advances today mean that it is now possible for one researcher to sequence a human gene in a month. Cliff Saran finds out how IT is at the core of Cancer Research's efforts to find a cure for cancer.
A breakthrough in genome technology, a decade after the human genome was sequenced, is changing the way researchers fight diseases. Peter Macallum, head of IT and scientific computing at Cancer Research UK, said high-performance computing is core to the research organisation's fight to cure cancer.
"The human genome project took 10 years and billions of pounds to complete. It can now be done in this building [at Cancer Research in Cambridge] using commodity factory-produced sequencers that can generate a human genome sequence in less than a month," he said. Today, the price of a complete human genome is around $4,000.
This means more and more laboratories now have access to sequencer technology. "Cancer is a generic disease and is characterised by genetic changes in one individual's tumour make-up," said Macallum. "We can take samples from a patient to better understand the genetic changes that underline the disease."
For Macallum, biological research now has less to do with staring down a microscope, and more to do with technology. "Biological research is becoming more IT-driven," he said. "It is all about number crunching, data analysis and efficient retrieval of archived analyses."
The sequencer works a bit like a camera, with a charge-coupled device (CCD) sensor that effectively produces image data. This image data is turned into sequence data, which has to be aligned against a reference human genome. The four sequencers at Cancer Research generate up to 5TB of data per week. Running various algorithms on this data results in about 250GB of data.
High-performance computing cluster
This is why Cancer Research needs a high-performance computing cluster. It runs a 1,280-core computational cluster built using HP C7000 blade enclosures and DL460 blades with 16-48GB of memory per core to process the sequencer information.
One of the issues the organisation faces is that it did not want to run dedicated clusters for genome sequencers because other groups at Cancer Research also need to run experiments and simulations.
To share access to the computing cluster Macallum needed scheduler technology to allocate and prioritise jobs with management modules, which could allow people to request what they need from the cluster to run their computing workloads.
Using LSF scheduling software from Platform Computing, which understands memory configurations for different blades in the cluster, Cancer Research can offer different levels of service to different groups within the organisation.
Substantial storage requirements
To run a cluster you need a storage system that can cope with hundreds of processors, so efficient storage is a significant requirement at Cancer Research.
It uses the Luster open source parallel file system running on HP's fibre channel storage infrastructure with 96TB connected directly to the cluster. The configuration is optimised to keep data throughput high.
Cancer Research also uses network attached storage (NAS) based on the Linux XFS file system, which is used to store data that is being analysed by users.
In terms of archiving, Macallum said researchers repeatedly go back to stored data, the value of which increases in value the longer it is kept. "We keep 100TB per year, and that number is going up and up."
To support this requirement, Cancer Research uses HP Ibrix, which provides half a petabyte of disk-based back-up. The Ibrix system (HP X9000) allows the organisation to change the back-end storage, without affecting users.
Software needs to be multi-threaded
While the cluster is achieving 90-95% utilisation, Macallum said software is its weak point. "In terms of programming, central processing unit (CPU) usage is only 60% because the software is not written as a multi-threaded application." This means it cannot make the best use of the multi-core architecture on which the Cancer Research computing cluster is built. "Some of the key software needs to be multi-threaded."
Cancer Research is also interested in graphics processing unit (GPU) technology, which can run massively parallel applications using relatively low-cost graphics hardware. However, to support this, Macallum said it would have to rewrite its algorithms at a very basic level and it would be difficult to map the data analysis-heavy workloads used by Cancer Research onto a GPU architecture.