How to manage a universe of data

In the following two case studies we look at how to effectively track and store vast amounts of data to enable both long-term storage and instant retrieval

In the following two case studies we look at how to effectively track and store vast amounts of data to enable both long-term storage and instant retrieval.

At the heart of a planet-hunting research project based at Leicester University lies huge amounts of data storage.

 In January 2005, the university implemented a massive integrated storage system based on tape and disc storage as a repository for 100Tbytes of planetary observation data.

The IT project started when the Department of Physics and Astronomy at the university went on the hunt for a system to store image data captured by the Wide Angle Search for Planets (Wasp) Consortium project, a collaborative venture involving a number of universities in the UK.

Wasp identifies new planets by searching for slight dips in the brightness of stars as a planet passes in front of them, which blocks some of the star's light. Wasp telescopes record tens of thousands of stars every minute, and can send 8,000 images back to the university every night.

Scientists analyse the data, searching for evidence of new planets and this creates yet more data. The images and data are stored at the University of Leicester in a database recording observations of tens of millions of different stars.

The observational data from Wasp is available to academics worldwide, and over the initial five-year life of the project it is estimated 100Tbytes of data will be collated.

The university was looking for an integrated system that came in at a competitive price, and which could allow it to add capacity as it was required.

Among the storage systems the university considered were large hierarchical arrays of discs with traditional back-up tape cabinets. However, few of the systems appealed, said Richard West, research fellow at Leicester University, who heads the university's work on the Wasp project.

Instead, West was attracted to a proposition from storage provider SGI. The two organisations already had a good relationship and the university had bought a significant amount of hardware from SGI.

SGI's proposal was to build a multi-supplier storage system, with SGI being a single point of contact and support for the equipment. This, in addition to the £350,000 price tag, made the system the most attractive for the university.

The system itself uses tape, disc and storage software from SGI, Engenio and Adic. Adic's Scalar i2000 storage array supplies 140Tbytes of tape storage to hold the vast amounts of raw data that the telescopes produce.

SGI's TP9300, based on technology from Engenio, offers an additional 30Tbytes of disc space for fast access to the smaller, processed data files. The university uses SGI's DMF data migration software to retrieve data quickly from the Adic tape library - almost as quickly as from the disc system.

The university receives raw planetary observation data which originates at telescopes on the island of La Palma in the Canary Islands and in South Africa. This data is shipped by courier to the UK and processed at other universities, which include Keele and Queen's Belfast.

These facilities then send the data via the Janet Internet network to Leicester University which applies some secondary processing using a cluster of Linux servers to organise and log the data.

The system has proved very reliable, but the main challenge was in managing the data that came in, and keeping track of where it was stored.

"The system presents us with a file system that is 200Tbytes in size, so if you do not manage what you are doing you are in trouble," said West.

The university therefore developed a Linux-based database application that runs on a low-cost AMD Opteron server, to keep track of the data. "An important part of our strategy is delivering a cost effective solution," said West.

One major benefit of the SGI storage system is that it offers flexible licensing through what is termed capacity on demand - which means that SGI sends additional storage cartridges and cabinets as they are required.

In 20 months of using the system, Leicester University has doubled its tape capacity, which it did not originally anticipate. This flexibility was the reason it chose capacity on demand. "University project funding often comes in dribs and drabs, and that sort of system allows us to buy capacity when we can afford it," said West.

"Our primary goal was to get as much storage as possible, and we calculated this on the split between disc and tape, with disc being a fraction for live data. Over time, we have added more tape but not more disc. We found that our usage patterns have changed and we are using tape for long-term storage," said West.

He said that although it is hard to measure a return on investment for the system, he believed that the management software was good and that the hardware was reliable. "It sits there and ticks away," he said.

In terms of expanding the system beyond 200Tbytes, West said, "We have enough storage for our projected data acquisitions over the next year or two, and then we will see how the funding goes."

How document management made the difference

Birmingham City Council is one of Europe’s largest councils, employing 55,000 staff and providing services to more than one million citizens. As a result of its size, the council’s data storage requirements are vast.

The council needed to adopt a new storage platform which had document management technology integrated into it so it could scan and store documents.

The council had a number of libraries that were full of documents, with more arriving each day. For example, it was receiving 800 planning applications a day.

Eman Al-Hillawi, electronic document management system project manager, said the problem hit home during an office move. “We had to dedicate a full floor to our physical archive,” he said.

In addition, the council was predominantly using paper-based processes, which were inefficient and time consuming. The council needed to meet e-government objectives, cut costs, conserve space and improve document control and access.

Al-Hillawi said, “We wanted to improve service to citizens. With paper files, only one council employee at a time can hold the master version of a document. That makes it difficult for other staff members to have timely access to all of the latest materials they need to make decisions.”

The council tested a number of different storage systems, using professional contractors and in-house benchmark testers, eventually choosing a unified storage system from NetApp.

This system met the various criteria of performance, connectivity, scalability and disaster recovery, said Andrew Jones, server manager at the council.

The architecture of the NetApp EDMS system was based on a NetApp FAS920c, a mid-range storage filer unit that can scale up to 12Tbytes. This is used to consolidate the council’s primary data for an EMC

Documentum document management application, as well as for storing user files.

The council deployed the system across its planning and urban design service areas, scanning and indexing all its paper documents. More than 150,000 documents have now been added to the system, including planning applications, project documents, drawings, photographs and specifications. The storage system also supports Autocad design images, voice files and video recordings.

The council linked its Oracle database into the storage system, and 1,500 council employees now access Oracle-based Documentum and Windows files stored on the NetApp system.

Read article: Expand your outsourcing horizons

Comment on this article: [email protected]

Read more on Integration software and middleware