
Organisations increasingly need to retain digital
documents indefinitely for legal, administrative or historical
purposes, and many IT managers are grappling with how to preserve
electronic documents and information for 100 years or
more.
Some of the issues they face include estimating the lifespan of
storage materials, the potential obsolescence of file formats,
backwards compatibility with applications and operating systems,
and whether methods of tagging data will still work in the
future.
Libraries, media firms, local government and insurers are among
the organisations investigating the implications of long-term
storage.
Case study:British Library
The British Library is leading the field in research into
long-term storage, being an organisation with a legal right, and an
obligation, to store copies of every public document and many
electronic files.
The library is pioneering long-term digital storage and
archiving technologies in Europe, and it is looking beyond 100
years, says Roderic Parker, communications officer of the Digital
Object Management programme at the British Library.
"As a national library, we have a duty to preserve particular
documents, and we have paper materials that date back to the dawn
of written documents," he says.
The British Library has a team of 12 full-time experts, and 12
more staff involved in the library's various storage and archiving
projects. The team is wrestling with challenges such as technology
obsolescence, selecting the right storage media, hardware
compatibility, the longevity of operating systems, obsolescence of
the formatting and structure, and document security.
"We are having to decide how many versions of how many word
processors we use - and how to retain computer aided design files,
photographic images, and sound and media files. Everything has
challenges, but we have so far been concentrating on building a
secure storage system," says Parker.
In July 2007, the British Library set up its third data storage
site, which meant that it had two storage centres on the library
premises at London and Boston Spa, and a new one at the Library of
Wales.
"We want to guarantee to the user that whether it is 50 or 200
years away, we will have a faithful replica of what we took in
during, say, 2007," says Parker.
The British Library stores 20 million books and manuscripts, 4.5
million maps, 56 million patents, and 3.5 million sound recordings,
as well as 58 million other items. It is also required to store
electronic journals, web archives, and digitally-published books
and newspapers.
The library has a policy of digital preservation so that it can
guarantee that it has the genuine article when it comes to a
particular document, whether that is a digitally stored picture or
a text that dates back several centuries.
As a result, one of the activities its team carries out is a
policy of continuous digitisation and digital preservation. In
practical terms, this means scanning documents and moving them on
to the newest versions of the software, or emulating older
applications on current software, so they can replicate the older
software's behaviour and still access older digital files that were
stored on older media.
In addition, the library stores multiple versions of a file and
extracts the data file's "bit stream", using cryptographic methods
to form a digest of the file.
"This way we can detect any changes that may be due to
deterioration in storage. Also, by having several copies of a
document, the chance of having three of them changing at the same
time is pretty minimal," says Parker.
He says that the library is doing a lot of work with data
"ingest" - getting material into the data store.
"There are huge problems with the structure of the material -
the meta data - and if you do not get the meta data right, the
end-user cannot be sure that the data itself is correct," he
says.
Some of the issues that the team is addressing include
recognising file formats and fully understanding how the various
formats work.
The team favours using open standards file formats, mainly to
avoid the problems of format obsolescence and supplier lock-in.
However, it did adopt
Microsoft's Office Open XML standard, which is approved by
standards organisation
ECMA. The library
was a leading partner in the lobby that was trying to open up
Microsoft's Office specifications so that its formats would be more
transparent.
The British Library also manages the European Union's Planets
research initiative for document preservation and long-term access
through networked services. The four-year project started in 2006
and is co-funded by the EU.
The British Library established its Digital Object Management
(Dom) programme to solve the problem of long-term storage of
digital documents.
The Dom team developed a system of dating and archiving
electronic documents equivalent to the traditional approach to
archiving publications.
The traditional approach starts with date-stamping content as it
is received. The chemical compositions of the physical items is
then examined to establish authenticity through whether the paper,
ink and binding are contemporary. They are also checked for signs
of tampering.
The digital equivalent that the Dom team developed is a secure
storage system that uses digital document signing. "The library
wanted a secure storage solution that would ensure that no material
is lost or altered," says Parker.
Technology obsolescence was a big problem, so the library
pursued hardware and software platform independence as much as
possible to hedge against it.
With a lack of publishing standards and a huge variety of
formats being used, including Word, Excel, PDF and HTML, there is
not a straightforward way to tackle this problem, apart from
storing files in multiple formats.
The library chose an algorithm-based digital document signing
system from nCipher. This provides a precise time stamp and an
individual public key infrastructure-based signature for every item
stored in the library.
The application, nCipher's Time Stamp Server, seals the digital
file, storing it four times in multiple formats on different brands
of storage to limit the chance of losing data.
By calculating an abstract numerical value based on the
information stored, the nCipher system notifies the British Library
every time an alteration is detected. This allows the library to
find and reinstate the unaltered earlier version of the document in
each instance.
The system also uses an external link to an official timing
authority, so that when the value calculated matches the one
originally entered, the library can say categorically that the item
is the genuine article, and that it is exactly as it was when it
was entered into the system: whether that was five minutes or 500
years ago.
Case study:Ordnance Survey
Mapping agency Ordnance Survey has developed a system to create
and maintain its digital database of data images, which it intends
to store "forever".
It previously kept its data, including 700Mbyte digital aerial
images, on Sata disc-based storage arrays. But this became
impractical and costly as the firm collects more than 40Tbytes of
photographic data every year.
It now stores its data on Ultra Density Optical (UDO) write
once, read many storage from Plasmon, and couples this with
BridgeHead Software's automated, policy-based archive software, HT
Filestore. The system uses intelligent automated policies to
archive the data from disc to the UDO storage.
Once the data is confirmed safe in the archive, HT Filestore
removes the content of the file, leaving only a 1Kbyte placeholder
on the disc. This placeholder allows the file content to be
accessed from the archive transparently.
Ordnance Survey plans to maintain the digital mapping archive in
perpetuity. The UDO media has a lifespan of up to 50 years, and the
software provides media migration capabilities to move data to
next-generation media in the future.
By constantly transferring the data from disc to media that is
powered down except when accessed, power consumption is reduced, as
are hardware and management costs related to keeping data on
primary storage, says Dave Lipsey, information systems
infrastructure manager at Ordnance Survey.
"Data is 91% of our revenue so we are very sensitive about its
longevity, value and currency," he says.
A looming data retention crisis?
In August, the Storage Networking Industry Association (SNIA)
Europe produced a "100 year archive requirements" report, which
discovered a looming long-term information retention crisis.
Eighty per cent of respondents said they needed to keep
information for more than 50 years, and 68% of respondents said
they must keep data for more than 100 years.
Juergen Arnold, chairman of SNIA Europe, says, "More and more
directives are being published at global, pan-European and
country-level requiring that organisations preserve data in a safe
and accessible format for decades. This should be an essential
element of most storage strategies going forward."
Arnold says users need to understand which information they need
to retain long-term, and what they need to dispose of, and then
apply an IT strategy.
"There is no format in the industry that will carry you through
the next 100 years, as a microfiche did in the past. You need to
use an open digital format, planning to use it for between five and
10 years, and prepare for a technology refresh after that," he
says.
"Good storage tapes last for 15 years, but will you have the
interface technology to read back the information, or the interface
to connect to it? This is why the technology refresh is
important."