IT at biology’s frontiers

In less than 10 years IT has gone from being a valuable tool for life sciences to being at the heart of some of the most important research projects ever undertaken.

In less than 10 years IT has gone from being a valuable tool for life sciences to being at the heart of some of the most important research projects ever undertaken.

The mapping of the human genome along with other areas of the life sciences have created frighteningly big and fast expanding repositories of biological data. Scientists expect to find within these oceans of information the answers to many of our most pressing problems, from cancer to mental illness and ageing. But with the pace of discovery far outpacing Moore's Law, IT has its work cut out.

The mapping of the human genome in 2003 created a mass of data which would create a print book stack 17 times higher than Everest.

It was a feat on par with Newton's Principia, the discovery of the double helix, and even the moon landing for what it meant to mankind. And it would not have been possible without the most sophisticated hardware and software in existence.

Gene database

Professor Janet Thornton - head of the European Bioinformatics Institute (EBI), a part of the European Molecular Biology Laboratory - presides over some of the world's most extensive databases of genes, genomes, proteins, nucleotides and other matter. EBI is one of the world's leading institutes for the application of information technology to biology.

Its European Nucleotide Archive, a repository for public nucleotide sequence data holds 3.47terabases (3.47 x 10 power 12 bases) of sequence, translating to 106.9TB (terabytes) in storage. The next release of UniProt, a collaborative protein information database, will contain information on nearly 8 million proteins. Currently the EBI's stores of biological data amount to 4petabytes, which is one quadrillion bytes or 1024TB.

It is reasonable to hope that sitting within EBI and other's bulging repositories lie the answers to many of our most pressing problems, such as Alzheimer's, cancer, the reasons for intelligence, causes of mental illness and ageing, to name but a few.

But the task of finding them now depends more than ever on the quality of innovations emerging from the IT sector.

Leading IT companies including IBM, Microsoft, EMC, Sun Microsystems, Oracle and others have been increasing their investment in life sciences in anticipation of strong market growth in the next few years.

Oracle, for instance, offers a suite of applications for pharmaceutical and medical research groups and boasts the top 20 organisations in both areas as its customers.

Similarly, IBM has seen its life sciences business expand significantly over the past ten years.

Driving the market is the fact that as more and more biological information is collected, more computing power is needed to go through it all and check for possible applications in disease treatment and health.

Take the Human Genome Project (HGP) for instance. The map of human DNA is one thing, but it is quite another to test the reactions of genes to drugs, and a virtually infinite sea of biological possibilities which might represent a cure for any given disease.

"There are fewer than 25,000 human genes - but try to do combinatorial studies between them and it starts to get quite mindboggling," says Thornton.

"To say that it is a mountain to climb is an understatement; the current flood of data easily outpaces Moore's law."

UK-based Titian services several of the world's largest pharmaceutical companies with systems to help in the management and rapid retrieval of biological samples, in most cases running into the millions.

"There are the chemicals which could be small molecules, synthesis compounds, or natural products and antibodies - all items the company potentially expects to be their drug compounds," explains CEO Richard Fry.

"It is a frozen treasure trove that could be the next billion-dollar drug."

High-throughput screening

When Darwin was looking for links between the species, he was able to use only what he could see in front of him. Now, thanks to high-throughput screening (HTS), researchers can see what is related to what in incredible detail.

"It just does not work anymore to have large numbers of people doing things manually," Fry says.

HTS allows a researcher to quickly conduct millions of biochemical, genetic or pharmacological tests. Through this process one can rapidly identify active compounds, antibodies or genes which modulate a particular biomolecular pathway. The results of these experiments provide starting points for drug design and for understanding the interaction or role of a particular biochemical process in biology.

A key tool for HTS is the microarray, a kind of biological computer chip made of glass designed to enable high-speed assaying of compounds and their reactions. But storing and managing the information being generated by HTS is, of course, a major challenge.

Furthermore, EBI's Thornton says that the development of new sequencing chains has increased the rate of gene sequencing by one, if not two, orders of magnitude.

"Now, realistically, we can generate 10 to the power of 9 (the size of one genome) in about two days," she says.

Just two years ago the expected time frame for accomplishing this would have been several years.

EBI is one of a handful of organisations involved in the Thousand Genomes Project, an international research consortium hoping to create a detailed picture of human genetic variation. The project involves sequencing the genomes of approximately 1,200 people from around the world.

Thornton says that a few years ago when the first tranche of data arrived at Cambridge from the project, it alone was greater than all of the genetic data then held by EBI. It was also the first time that EBI had taken information on specific individuals, an event that highlighted important issues of security and privacy in biological research.

The EBI's data is housed within a 160 square-metre section of the Wellcome Trust datacentre. Recently, Thornton and her team calculated that they would soon need ten-times that space to adequately house data from its many fast-growing projects.

Globally, further sources of new and yet-to-be understood data are being discovered. Seen as the chemical equivalent of the HGP, the Metabalone Project, led by Canada's University of Alberta, has so far listed close to 3,000 chemicals found in or made by the human body - triple what was expected, with double the number of substances stemming from drugs and food. The chemicals, known as metabolites, represent the ingredients of life, just as the human genome represents the blueprint for life, with the former presenting exciting new potential breakthroughs.

Microscopic images

Another emerging area is that of high-throughout, or large scale, analysis and processing of microscopic images.

"In the next 10 years we will have pictures of cells and organs that can be analysed by computers," predicts Thornton, adding, "This has not even started yet."

IBM's Healthcare Information and Imaging Grid (HIIG), launched last December, aims to address some of these challenges. The company also announced new software features for the IBM Grid Medical Archive Solution (GMAS), a high performance, grid-based storage solution.

Its new software component, GAM 2.1, will now support applications in digital pathology, high-throughput screening and mass spectrometry (MS). MS is an analytical technique for understanding the composition of a sample or molecule, and involves ionizing chemical compounds, and measuring things such as mass-to-charge ratios.

In-silico testing is another area expected to see huge growth in the next few years as computers get better at simulating clinical trials that would normally depend on data taken from animals and humans. The cost savings to pharmaceutical companies and research institutes would be enormous.

"Any rational scientist will say any method that can improve our ability to accurately predict the effects of drugs or chemicals on human beings has to be beneficial and would reduce the need for animal experiments," notes Thornton. "But again it will take time."

It has also been suggested that bioinfomatics may in future allow scientists to reconstruct the genomes of extinct animals and possibly bring them back to life.

The effective utilisation of all of this information will depend largely on technologies capable of managing and interpreting it all within a central repository, so different types of data can be effectively cross-referenced.

Open source

Further, as more and more information is accumulated around the world, it is crucial that scientists are better able to share data and collaborate. A compound reaction discovered in Japan may, for instance, have implications for a clinical trial in the UK.

Previously, such events would be the result of coincidence. Now the EBI and other groups are attempting to reach agreement on the development of open access platforms for biological data. Systems employing or modelled on the concept of open source software are expected to play a fundamental role. It is also hoped that internet-based "browsers" for cancer, genomes and other areas of research will aid in the sharing of information.

The success of these and other attempts to foster global collaboration could have major implications for drugs and other areas of discovery. But there is a long way to go.

"Of course, what we all want to do is convert that data into knowledge and understanding, and translate that into improved health, ecology and biodiversity," Thorton says.

"But there is much work in developing new algorithms and approaches to interpreting and finding patterns in the data; we still do not really understand the molecular basis for ageing."

Read more on Integration software and middleware