Cern leads the way in database innovation

In March 1989, Tim Berners-Lee submitted a proposal for an information management system to his boss,...

In March 1989, Tim Berners-Lee submitted a proposal for an information management system to his boss, Mike Sendall. 'Vague, but exciting', were the words that Sendall wrote on the proposal, allowing Berners-Lee to develop what eventually became known as the World Wide Web.

"I found it frustrating that in those days, there was different information on different computers, but you had to log on to different computers to get at it. Also, sometimes you had to learn a different program on each computer," said Berners-Lee on his website.

The proposal was originally intended to help scientists working on the big bang project to keep track of the masses of information they compiled in reports. The reason we have the web today is only because of the research needs of physicists at the European Organisation for Nuclear Research, Cern.

But this isn't the only case where the research needs of Cern's scientists have lead to innovations in web technologies.

In 1987 Cern worked with a US start-up with only 20 employees to develop and deploy one of the first routers in Europe - the ASM/2-32EM - to act as a firewall between Cern's public Ethernet and its supercomputer. That company was Cisco. Today, the company has more than 6,3000 employees.

And the innovations haven't stopped. In 2005, the physics laboratory built the first working intercontinental 10 Gigabit Ethernet wide area network to process the large amounts of data from the Large Hadron Collider (LHC) particle accelerator project. Applications like this are now rising to prominence in areas such as finance and in banking applications, according to analysts Gartner.

So if the technologies at Cern predicate future commercial trends in internet technology, what is the department working on at the moment and what could be next for the public face of the internet? One area is in using database technology to handle the masses of information generated by its computing grid.

Cern will be using one of the biggest computer grids this summer to pool the processing power of about 100,000 CPUs worldwide. It will process information at a rate of 1gbps, said Francois Grey, head of Cern's IT communications team.

"The experiment will produce roughly 15 petabytes (15 million Gbytes) of data a year - enough to fill 100,000 DVDs," he said.

The constant requirement for as much data processing power as possible led Cern to become one of the first users of clustering technology, starting in 1996. It pioneered the use of clusters of low-cost Linux hardware servers working together as one large, powerful machine. Cern helped develop software to ensure that the reliability and virtualisation capabilities of databases could be extended seamlessly across a cluster of commodity servers, greatly reducing the cost of high-performance computing.

Cern has also pushed database-clustering technology further to enable a single database to run across a number of distributed computers. The LCG database deployment project has set up a worldwide distributed database infrastructure for LHC.

It will do this using a program called Oracle Streams to capture, filter and synchronise data stores worldwide.

The software allows users to control what information is put into a stream - the connection between the primary data capture and its end source/sources - and will determine how the stream of data flows is routed to nodes worldwide, and to determine what happens to events in the stream and how the stream terminates. By specifying the configuration of the elements acting on the stream, a user can filter and manage data in a more meaningful way.

"The amount of data people are using on the web is only going to grow as pipes get fatter and connection speeds are ramped up. As the architectures for high-speed networks are installed, they will only be as good if the underlying databases are able to deal with gigabytes and maybe even petabytes of data," said Grey.

For companies with global operations, keeping mass stores of data synchronised will be the next challenge, especially as data processing requirements will increase.

"For us, monitoring the database and streams performance has been key towards maintaining grid control and in optimising any larger scale set-up," said Grey.

While the challenges at Cern remain unresolved at present, history would indicate that synchronising databases across grid set-ups and dealing with petabytes of data on an annual basis will be a challenge for commercial organisations further down the line.

And if the work at Cern has shown one thing over time, it has been the willingness to share the solutions to their problems with the wider world.

Read more on IT project management