Peter Murray-Rust and the data-mining robots

Peter Murray-Rust, a reader in molecular informatics at the University of Cambridge, has a vision. In his vision, software robots roam the network collecting scientific information, which they aggregate and process to arrive at new insights. Sometimes they make scientific discoveries.

Peter Murray-Rust, a reader in molecular informatics at the University of Cambridge, has a vision. In his vision, software robots roam the network collecting scientific information, which they aggregate and process to arrive at new insights. Sometimes they make scientific discoveries.

Before this can happen, however, an enabling infrastructure will need to be built - a task Murray-Rust has been dedicated to for 30 years. During that time he has also become a passionate supporter of the Open Data movement, which advocates for non-textual material such as chemical compounds, genomes, mathematical and scientific formulae, and bioscience data to be made freely available on the web.

Murray-Rust's epiphany came while on sabbatical in Zurich in the late 1970s, where he spent most of his time in a library, poring over the thousands of molecular structures published in chemical journals. "I spent six months going through the literature and came home with several hundred data points," he says. "Each data point was the product of a visit to the library to find a single piece of information in a journal."

For every molecule he wanted to research, he had to extract all the published data and then do a complex calculation. "A paper might give you the coordinates of the atoms of a molecule, for instance, but not the distances. So you had to do a little sum - and until you had done the sum you did not know whether the answer made sense," Murray-Rust says.

Knowing the structure and coordinates of molecules is important, not least when developing new drugs. It is essential, for instance, to know as precisely as possible how the molecules in a drug are likely to interact with the molecules in a patient's body, and consequently what side-effects are likely.

Leafing through the journals in Zurich, Murray-Rust was struck by how much valuable data was hidden within them, but was convinced that there had to be a better way of extracting them. "A great many discoveries in science rely on using the information in the literature," he says. "And occasionally a very talented individual is able to digest a mass of seemingly unrelated data - the boiling point of this, the colour of that, the mass of something else - and bring it all together in a self-consistent way. A good example of this is the periodic table."

But such insights are infrequent, even for gifted individuals. Wouldit not be better, Murray-Rust thought, to let computers do the heavy lifting? "I had this vision that if we could extract this data we could have robots go round collecting it and coming up with new chemical hypotheses."

On his return home, Murray-Rust visited Cambridge, where researchers had pioneered the use of X-ray crystallography for creating 3D pictures of the atoms in molecules. In the process, researchers had created the Cambridge Crystallographic Data Centre (CCDC) where crystal structures were deposited.

After writing software to mine the CCDC, Murray-Rust was soon able to achieve in the flick of an eye what had taken six months of laborious work in Zurich. "Each crystal structure was the result of an isolated experiment. But once we were able to put them all together we could see trends and clusters that were totally invisible if you only looked at each paper in which their details had been published."

But with just 20,000 crystal structures, the CCDC was only scratching the surface. Convinced that a lot more would be needed, in 1982 Murray-Rust took a job with GlaxoSmithKline, working on computational chemistry, and later protein structure determination.

When the web exploded into life, Murray-Rust saw its potential immediately - particularly after Tim Berners-Lee outlined his vision of the Semantic Web, promising the advent of complex machine-to-machine interaction. Murray-Rust vowed to create a Semantic Web of chemicals.

First, however, the treasures he had glimpsed in the chemical literature would need to be extracted, and made freely available on the web. They would also need to be in a machine-readable format.

Returning to academia in 1996 - as professor of pharmacy at Nottingham University - Murray-Rust began collaborating with Henry Rzepa. Together they created the XML-DEV mailing list, where list members quickly hammered out the Simple Application Programming Interface for XML (SAX) parsers. SAX was to prove a vital building block for the Semantic Web, and a de facto standard.

Murray-Rust and Rzepa then developed Chemical Markup Language (CML), a domain-specific implementation based strictly on XML. CML allows researchers to formally describe chemical compounds (molecules and substances), chemical reactions, spectroscopy, crystallography, and the output of chemical computation. "I want researchers to use CML as the primary tool for putting chemistry into scientific publications," Murray-Rust says.

After moving to Cambridge, Murray-Rust played a key role in the creation of a toolkit to make this possible, including an authoring tool (Word for chemicals), a rendering system for displaying chemical information (a chemical browser), and online repositories for housing chemical data.

In typical open source fashion, much of this work has been undertaken by volunteers, working under the umbrella of the Blue Obelisk group. "We are not yet in a position to walk into a typical chemical information organisation and say, 'Here is a complete set of tools which will do everything you want'," says Murray-Rust. "But we have got all the components to proof-of-concept stage, and some to a higher level."

And to make chemical data machine-intelligible Blue Obelisk members are developing Resource Description Framework tools and ontologies. "Ontologies are vital," says Murray-Rust. "If, for instance, a paper says that benzene melts at 6 degrees centigrade, there will need to be an ontology that can define what benzene is, an ontology that defines what it means to melt, and an ontology that defines what centigrade means."

But building the technical infrastructure has proved to be the easy part. Although most chemical journals are now electronic, extracting the data, and making them freely available on the web, is proving very difficult.

While text-mining tools are still not perfect, they can do a passable job of extracting "embedded data" from electronic journals - eg, graphs, tables, charts, molecular structures, and spectral and crystallography data. The bigger problem is that in doing so Murray-Rust finds himself in uncharted legal territory, since many publishers regard text mining as illicit. "There is a general view amongst publishers that the full-text is sacrosanct," says Murray-Rust. "They would say, for instance, 'We own everything in the PDF file containing the article, including the embedded data'."

This is odd, he adds, since the only data he wants to extract are of a factual nature - and as facts are not subject to copyright they belong to no one. Nevertheless, the problem was demonstrated last year when Michigan-based graduate student Shelley Batts copied a graph from a Wiley journal, and put it on her website. Wiley threatened to sue her. Although Batts was able to evade the lawyers by retyping the data, this is not an option for Murray-Rust, as he wants to extract embedded data on an industrial scale.

Last year, for instance, one of his postgraduate students created a web spider called Crystaleye. This "listens" for when new journal issues are published and then scans them for any crystal structures. When it finds one it downloads the file, translates the data into CML, and puts it in a CCDC-like repository. To date 110,000 structures have been extracted. "Crystaleye transforms the scholarly publication of crystallography into a giant knowledgebase of much greater power than the isolated articles," says Murray-Rust.

Murray-Rust also wants to mine the "supplemental information" files attached to science papers. Again, these are just collections of facts. "In almost all cases we are simply talking about a record of the experiment, which will likely include temperatures, materials, and analytic results etc. Or it might be just a copy of the computer output of a simulation."

But even though facts cannot be copyrighted, some publishers require researchers to sign over ownership of these files as a condition of publication. "I suspect that most authors simply do not realise that when they sign the transfer of copyright form they are also signing over ownership of the supplemental information," says Murray-Rust. "And they do not realise that doing so is in any case a meaningless act legally."

While some publishers pursue a proprietary approach as part of a deliberate policy, says Murray-Rust, others simply have not thought it through. "I am sure there are publishers out there that do not understand they are creating a problem, and if told about it would change."

In the meantime, however, the lack of clarity puts Murray-Rust in an invidious position. Although confident he is doing nothing illegal, he fears receiving lawyers' letters. "If it goes to court, even if I think I am right, I face being accused of breaking the law."

In short, it turns out that creating the right legal infrastructure is as important as building the technical infrastructure. Fortunately, Science Commons recently developed a licence - the Public Domain Dedication & Licence - that allows data creators to signal their agreement to their data being reused. And since the licence can be expressed as metadata, it will be comprehensible to automated services such as Crystaleye.

Today, therefore, the main challenge is to persuade researchers and publishers to share their data, which is why Murray-Rust is now a passionate advocate of Open Data - a cause to which he spends an increasing amount of time, involved in activities such as lobbying publishers, educating researchers, and alerting the world to the issue via his blog.

He does not grudge this. "Sharing our knowledge is a necessary but not sufficient condition for saving the planet," he says. "And here I am not just talking about global warming - I am also talking about how we save the planet from disease, from ignorance, and from all sorts of other things. Open Data are a critical part of the infrastructure we need for 21st century living. And this goes way beyond science - it is also about things like map data, climate data, and traffic data. So you are going to be hearing a lot more about open data in the next five years."

He is, however, in a hurry. "I want to see robots creating new knowledge," he says. "And I want to see it happen before I finish doing science."

Read more on E-commerce technology