IBM is to sponsor IT research at Cern, the European Organisation for Nuclear Research, testing its future storage networking technology, Storage Tank, at the Cern openlab for DataGrid applications.
In 2007, Cern expects to begin operating the Large Hadron Collider (LHC), a particle accelerator that will bring protons and ions into head-on collisions at higher energies than ever achieved before.
The experiment, which aims to recreate the conditions prevailing in the early universe, just after the Big Bang, will generate around 10 petabytes (10 million gigabytes) of data each year.
Cern will make that volume of data available to the scientific community for analysis by building a distributed data storage and grid computing network, accessible to researchers around the world.
"The drive is to get a working grid up that can deal with the petabytes of data coming out of the LHC by 2007," said François Grey, Cern openlab development officer.
"We are investigating techniques that are not yet commercial but will be by the time LHC is running," he said, adding that it would also be an opportunity for Cern's industrial partners to test their technology in real-world applications, he added.
The first two industrial sponsors, Hewlett-Packard and Enterasys Networks, joined the effort last September. HP contributed a 32-node cluster of computers built around Intel's Itanium 2 processors. Enterasys donated a 10Gbps Ethernet network to connect them, and agreed to provide engineering assistance and product and technology forums for a total investment of around $1.5m.
For its part, IBM will supply 20Tbytes of disk storage, a cluster of six eServer xSeries systems running Linux, and on-site engineering support to a total value of $2.5m. The equipment will be delivered by the end of the year.
That 20T bytes of storage is a long way from the volume that CERN ultimately envisages, but the goal is to bring in more storage progressively, so as to conduct tests with around a petabyte of storage by 2005, Grey said.
With the collider generating 100Mbytes of data per second in operation, the data management task is huge.
"It's really out of the scope of traditional network-attached storage. When you have these quantities of data, managing and organizing them is a problem," said Brian Carpenter, distinguished engineer at IBM Systems Group. "That's where Storage Tank comes in."
Storage Tank uses metadata servers to keep track of where data is located. Network clients ask the servers where to find the data they want, then download it straight from the network storage devices where it is located - rather like the way the Internet's DNS (Domain Name System) points clients towards hosts, but does not intervene in the transfer of data from them, Carpenter said.
IBM will use the project as a testbed for this storage virtualisation and file management technology, which it says will play a pivotal role in its work with Cern.
This implementation of Storage Tank will use the iSCSI San (storage area network) protocol, running over 10G-bps Ethernet, but, "the way Storage Tank is designed, it could be over any San in the back end," Carpenter said.
The system runs principally on Linux, but the idea is to make the software more widely available than that, particularly the client software needed to integrate with the local filing system, he said.
The Storage Tank client software will work with the Windows, AIX, Solaris and Linux operating systems.
Carpenter said there are other applications of considerable economic importance that involve the scientific study of similarly large data sets, such as the analysis of seismological data for oil exploration.