As the quest for the paperless office continues, Philip Hunter
reports on current thinking on digital archiving
Until recently just a few organisations needed to worry about
maintaining long-term digital archives. But now it concerns every
major enterprise.
Financial data has to be kept for between five and 10 years and
now e-mail has to be archived partly for legal reasons and partly
because past messages are increasingly being recognised as an
important source of information. There is also growing demand to
store transactional and customer-related data generated by
e-commerce applications.
At the same time libraries and records offices are now offering
online access to users, as well as providing paper- or film-based
documents and images. The question arises in this case of when it
will be desirable or safe to jettison the paper-based versions of
this information.
A major issue that has to be addressed with ageing archives is
the physical longevity of the storage medium and the readability of
the format, which may be discontinued.
Recently, the UK Public Records Office published a report
entitled Electronic Records Created in Office Systems urging
organisations to recycle archived data onto new media every three
years. This would be a significant burden for many enterprises,
with implications both for the choice of storage systems and IT
management. The issue can be circumvented for the enterprise by
outsourcing the archiving, but it still has to be tackled by the
service provider.
The first step in setting an archive strategy is to define the
requirements. There are really two factors to consider: the nature
of the data; and the frequency and mode of access.
Taking data first, a distinction can be made between complete
independent entities such as images or text documents, and database
records that may be relatively meaningless without links to
associated data. Then on the access front, the most obvious
distinction is by frequency of access. At one extreme is
information that is virtually never accessed, where the motive for
going digital is to save space. In this case, a storage medium that
is slow to access can be chosen, with the emphasis being on
immutability and durability. On the other hand some archived data,
particularly when relatively young, may be accessed quite a lot and
so needs to be retained on a faster storage medium, possibly disc
and certainly some form of immediately online storage.
The mode of access should also be considered, such as whether
users want the ability to search the data and download extracts on
demand, or print off copies of documents.
There is also the size of the archive to consider. For smaller
records, optical-based storage systems such as CD-Rom or Write Once
Read Many (Worm) drives are ideal because of their high reliability
and durability. For very high data volumes, tape systems are more
convenient and cost effective, because of their greater capacity
and lower unit storage price. But tape is not suitable when
high-speed online access is required.
One archive that has to satisfy the worst of all cases is the US
Government's National Satellite Land Remote Sensing Data Archive,
comprising data supplied largely by NASA. It currently holds
120Tbytes, but this is expected to increase to 2,400Tbytes by 2005.
It is one of the world's largest archives of calibrated data, and
certainly the largest continuously available online. With terabyte
drives expected to be available by 2005, it will still need 2,400
units to serve its users.
There is one feature that virtually all digital archives share:
the data is only written once. Archive data is rarely overwritten
and it is often highly desirable not to be able to do so. For this
reason, this month's announcement by storage system maker
StorageTek of a non-erasable write-once tape-based system was a
significant development. Tape is widely used for archiving already
and rarely rewritten but it is only now that a suitable system
allowing multiple reads but prohibiting rewrites has become
available.
Even without this write-once capability, tape is still the best
medium for large scale archiving of conventional transactional data
that needs occasional but not frequent access, according to Ian
Massingham, hosting operations director at Energis2, the Internet
arm of telecoms carrier Energis. Discs are too expensive, and
optical storage lacks the capacity for his company's Internet-based
data archiving service.
But Peter Roberts, sales and marketing director of archive
storage supplier IXOS, argues that tape systems are not reliable
enough for applications involving smaller volumes of data,
especially where for legal or other reasons loss of information
could have serious consequences. "We suggest you back up onto
something very tried and trusted, almost like having data set in
glass so that you can see it, but can't touch it," says Roberts.
"Nowadays CD-Rom or Worm drives are more suitable for that than
older media such as tapes."
The argument for disc is put by Ajay Lukha, European director of
storage system maker Storcase. "Tape may be cheapest, but disc
drives are coming down quickly in price, and are definitely in the
sweet spot for price/performance," says Lukha. They should be
chosen, therefore, for archived data that needs to be accessed
frequently, Lukha contends.
These contradictory arguments really reinforce the point that
you should first analyse your archiving requirements carefully and
then pick the best horse for the course. In many cases this should
involve a combination of media, according to Stephen Gerrard of
Princeton Softech, another supplier in this field. Many
enterprises, says Gerrard, are keeping data that should be archived
in their production disc-based systems. This is motivated by the
fear that once relegated to tape or some other medium it might
prove inaccessible when needed.
Drive performance
Although discs drives are indeed faster than ever, their
performance is not increasing as rapidly as the average
enterprise's total volume of data. As a result, access times
deteriorate as the data mountain accumulates, unless some is backed
up regularly to an archive. "You are actually paying a price for
keeping all that data every time you run your applications," says
Gerrard. "If you can archive some of it safely, you can liberate a
lot of computing power, and often roll back plans for expanding
computer capacity, saving quite a bit of money."
The recommended solution is then a hierarchical approach whereby
data is created in memory, retained in cache if very frequently
accessed, then stored in disc drives if untouched for, say, a day.
Then after a period of a year, or if it has not been touched at all
during a specified period, it can be backed up to tape or perhaps
optical storage.
There may be a further migration back to microfilm or paper for
information that is never likely to be accessed online again, but
whose retention is still deemed desirable. Eventually purely
digital mechanisms may take over even for ultra long-term storage
of data, but at present there is insufficient trust in the
longevity of the media. So currently some organisations with
long-termstorage requirements are in a state of transition.
One example is CompaniesHouse, which has been offering its
customers onlineInternet access to corporate records since March
1999 and which won the Computer Weekly E-government
Excellence Award thismonth. Companies House has digital records
alongside paper-based or microfilm alternatives, and this is likely
to continue for the foreseeable future, according to Steve Cryer, a
project manager in the IT department. "We will get to the point
where we won't have paper, but we will still have microfilmbecause
many customers still want it," says Cryer.
But a significant number of organisations are unwilling to trust
digital media even for short-term storage and will continue to use
paper, particularly for documents where legal ownership or
copyright are issues. In such cases IT may still be used to manage
and keep track of paper documents, but there would only be one
version. According to Nicholas Gomersall, managing director of
Acumen Business Solutions, which supplies document management
software, many enterprises are still wary of committing all their
documents to electronic media, partly because they feel unable to
cope with the issues of ageing data and having to relocate
periodically to new media. Companies are also deterred by not
knowing the most suitable format to store documents in, whether
this should be HTML, text, PDF files or some other medium.
Acumen's software can be used to catalogue documents in a mixed
environment. The key then, says Gomersall, is to have a
well-managed process for delivering documents quickly when users
request them.
If transactional data is archived to paper, you must sort out
how to represent data inter-dependencies in the archive. Within a
production database, such interdependencies are catered for by a
combination of metadata providing an index to the information, and
application software.
According to Gerrard, it is vital that interdependencies are
captured in archived data, otherwise it will prove meaningless in
years to come. At least this is recognised in the UK Government's
guidelines on electronic records management, with the advice,
"Record capture mechanisms should include all necessary metadata
needed to access and manage the electronic record throughout the
full lifecycle."
Also needed are tools for accessing the archived data
effectively and performing searches within it, says Gerrard. "You
also need support for granular transactions so if you need to
access data that happens to comprise just five or 10 rows, you
don't have to restore all 50 million rows of that table."
This affects performance, which rapidly becomes a major
consideration as the archive swells in size and online access is
required.
It is clear then that the science of digital archiving is still
in its infancy. The IT industry is not yet old enoughtohave proved
that it has the answers for long term archiving, and to date paper
remains the only established medium for holding informationover
very long periods. Eventuallysome form of automated staging will be
achieved, in which data is maintained in a usable form on current
media withouthuman intervention. But long-term data storage remains
an ad hoc process of restaging data at arbitrary intervals, as NASA
has found.