Strategies for long-term data retention

Organisations increasingly need to retain digital documents indefinitely for legal, administrative or historical purposes, and many IT managers are grappling with how to preserve electronic documents and information for 100 years or more.

Organisations increasingly need to retain digital documents indefinitely for legal, administrative or historical purposes, and many IT managers are grappling with how to preserve electronic documents and information for 100 years or more.

Some of the issues they face include estimating the lifespan of storage materials, the potential obsolescence of file formats, backwards compatibility with applications and operating systems, and whether methods of tagging data will still work in the future.

Libraries, media firms, local government and insurers are among the organisations investigating the implications of long-term storage.

Case study: British Library

The British Library is leading the field in research into long-term storage, being an organisation with a legal right, and an obligation, to store copies of every public document and many electronic files.

The library is pioneering long-term digital storage and archiving technologies in Europe, and it is looking beyond 100 years, says Roderic Parker, communications officer of the Digital Object Management programme at the British Library.

"As a national library, we have a duty to preserve particular documents, and we have paper materials that date back to the dawn of written documents," he says.

The British Library has a team of 12 full-time experts, and 12 more staff involved in the library's various storage and archiving projects. The team is wrestling with challenges such as technology obsolescence, selecting the right storage media, hardware compatibility, the longevity of operating systems, obsolescence of the formatting and structure, and document security.

"We are having to decide how many versions of how many word processors we use - and how to retain computer aided design files, photographic images, and sound and media files. Everything has challenges, but we have so far been concentrating on building a secure storage system," says Parker.

In July 2007, the British Library set up its third data storage site, which meant that it had two storage centres on the library premises at London and Boston Spa, and a new one at the Library of Wales.

"We want to guarantee to the user that whether it is 50 or 200 years away, we will have a faithful replica of what we took in during, say, 2007," says Parker.

The British Library stores 20 million books and manuscripts, 4.5 million maps, 56 million patents, and 3.5 million sound recordings, as well as 58 million other items. It is also required to store electronic journals, web archives, and digitally-published books and newspapers.

The library has a policy of digital preservation so that it can guarantee that it has the genuine article when it comes to a particular document, whether that is a digitally stored picture or a text that dates back several centuries.

As a result, one of the activities its team carries out is a policy of continuous digitisation and digital preservation. In practical terms, this means scanning documents and moving them on to the newest versions of the software, or emulating older applications on current software, so they can replicate the older software's behaviour and still access older digital files that were stored on older media.

In addition, the library stores multiple versions of a file and ­extracts the data file's "bit stream", using cryptographic methods to form a digest of the file.

"This way we can detect any changes that may be due to deterioration in storage. Also, by having several copies of a document, the chance of having three of them changing at the same time is pretty minimal," says Parker.

He says that the library is doing a lot of work with data "ingest" - getting material into the data store.

"There are huge problems with the structure of the material - the meta data - and if you do not get the meta data right, the end-user cannot be sure that the data itself is correct," he says.

Some of the issues that the team is addressing include recognising file formats and fully understanding how the various formats work.

The team favours using open standards file formats, mainly to avoid the problems of format obsolescence and supplier lock-in.

However, it did adopt Microsoft's Office Open XML standard, which is approved by standards organisation ECMA. The library was a leading partner in the lobby that was trying to open up Microsoft's Office specifications so that its formats would be more transparent.

The British Library also manages the European Union's Planets research initiative for document preservation and long-term access through networked services. The four-year project started in 2006 and is co-funded by the EU.

The British Library established its Digital Object Management (Dom) programme to solve the problem of long-term storage of digital documents.

The Dom team developed a system of dating and archiving electronic documents equivalent to the traditional approach to archiving publications.

The traditional approach starts with date-stamping content as it is received. The chemical compositions of the physical items is then examined to establish authenticity through whether the paper, ink and binding are contemporary. They are also checked for signs of tampering.

The digital equivalent that the Dom team developed is a secure storage system that uses digital document signing. "The library wanted a secure storage solution that would ensure that no material is lost or altered," says Parker.

Technology obsolescence was a big problem, so the library pursued hardware and software platform independence as much as possible to hedge against it.

With a lack of publishing standards and a huge variety of formats being used, including Word, Excel, PDF and HTML, there is not a straightforward way to tackle this problem, apart from storing files in multiple formats.

The library chose an algorithm-based digital document signing system from nCipher. This provides a precise time stamp and an individual public key infrastructure-based signature for every item stored in the library.

The application, nCipher's Time Stamp Server, seals the digital file, storing it four times in multiple formats on different brands of storage to limit the chance of losing data.

By calculating an abstract numerical value based on the information stored, the nCipher system notifies the British Library every time an alteration is detected. This allows the library to find and reinstate the unaltered earlier version of the document in each instance.

The system also uses an external link to an official timing authority, so that when the value calculated matches the one originally entered, the library can say categorically that the item is the genuine article, and that it is exactly as it was when it was entered into the system: whether that was five minutes or 500 years ago.

Case study: Ordnance Survey

Mapping agency Ordnance Survey has developed a system to create and maintain its digital database of data images, which it intends to store "forever".

It previously kept its data, including 700Mbyte digital aerial images, on Sata disc-based storage arrays. But this became impractical and costly as the firm collects more than 40Tbytes of photographic data every year.

It now stores its data on Ultra Density Optical (UDO) write once, read many storage from Plasmon, and couples this with BridgeHead Software's automated, policy-based archive software, HT Filestore. The system uses intelligent automated policies to archive the data from disc to the UDO storage.

Once the data is confirmed safe in the archive, HT Filestore removes the content of the file, leaving only a 1Kbyte placeholder on the disc. This placeholder allows the file content to be accessed from the archive transparently.

Ordnance Survey plans to maintain the digital mapping archive in perpetuity. The UDO media has a lifespan of up to 50 years, and the software provides media migration capabilities to move data to next-generation media in the future.

By constantly transferring the data from disc to media that is powered down except when accessed, power consumption is reduced, as are hardware and management costs related to keeping data on primary storage, says Dave Lipsey, information systems infrastructure manager at Ordnance Survey.

"Data is 91% of our revenue so we are very sensitive about its longevity, value and currency," he says.

A looming data retention crisis?

In August, the Storage Networking Industry Association (SNIA) Europe produced a "100 year archive requirements" report, which discovered a looming long-term information retention crisis.

Eighty per cent of respondents said they needed to keep information for more than 50 years, and 68% of respondents said they must keep data for more than 100 years.

Juergen Arnold, chairman of SNIA Europe, says, "More and more directives are being published at global, pan-European and country-level requiring that organisations preserve data in a safe and accessible format for decades. This should be an essential element of most storage strategies going forward."

Arnold says users need to understand which information they need to retain long-term, and what they need to dispose of, and then apply an IT strategy.

"There is no format in the industry that will carry you through the next 100 years, as a microfiche did in the past. You need to use an open digital format, planning to use it for between five and 10 years, and prepare for a technology refresh after that," he says.

"Good storage tapes last for 15 years, but will you have the interface technology to read back the information, or the interface to connect to it? This is why the technology refresh is important."

Next Steps

How a backup retention policy can save valuable data storage space

Read more on Privacy and data protection