Until recently just a few organisations needed to worry about maintaining long-term digital archives. But now it concerns every major enterprise.
Financial data has to be kept for between five and 10 years and now e-mail has to be archived partly for legal reasons and partly because past messages are increasingly being recognised as an important source of information. There is also growing demand to store transactional and customer-related data generated by e-commerce applications.
At the same time libraries and records offices are now offering online access to users, as well as providing paper- or film-based documents and images. The question arises in this case of when it will be desirable or safe to jettison the paper-based versions of this information.
A major issue that has to be addressed with ageing archives is the physical longevity of the storage medium and the readability of the format, which may be discontinued.
Recently, the UK Public Records Office published a report entitled Electronic Records Created in Office Systems urging organisations to recycle archived data onto new media every three years. This would be a significant burden for many enterprises, with implications both for the choice of storage systems and IT management. The issue can be circumvented for the enterprise by outsourcing the archiving, but it still has to be tackled by the service provider.
The first step in setting an archive strategy is to define the requirements. There are really two factors to consider: the nature of the data; and the frequency and mode of access.
Taking data first, a distinction can be made between complete independent entities such as images or text documents, and database records that may be relatively meaningless without links to associated data. Then on the access front, the most obvious distinction is by frequency of access. At one extreme is information that is virtually never accessed, where the motive for going digital is to save space. In this case, a storage medium that is slow to access can be chosen, with the emphasis being on immutability and durability. On the other hand some archived data, particularly when relatively young, may be accessed quite a lot and so needs to be retained on a faster storage medium, possibly disc and certainly some form of immediately online storage.
The mode of access should also be considered, such as whether users want the ability to search the data and download extracts on demand, or print off copies of documents.
There is also the size of the archive to consider. For smaller records, optical-based storage systems such as CD-Rom or Write Once Read Many (Worm) drives are ideal because of their high reliability and durability. For very high data volumes, tape systems are more convenient and cost effective, because of their greater capacity and lower unit storage price. But tape is not suitable when high-speed online access is required.
One archive that has to satisfy the worst of all cases is the US Government's National Satellite Land Remote Sensing Data Archive, comprising data supplied largely by NASA. It currently holds 120Tbytes, but this is expected to increase to 2,400Tbytes by 2005. It is one of the world's largest archives of calibrated data, and certainly the largest continuously available online. With terabyte drives expected to be available by 2005, it will still need 2,400 units to serve its users.
There is one feature that virtually all digital archives share: the data is only written once. Archive data is rarely overwritten and it is often highly desirable not to be able to do so. For this reason, this month's announcement by storage system maker StorageTek of a non-erasable write-once tape-based system was a significant development. Tape is widely used for archiving already and rarely rewritten but it is only now that a suitable system allowing multiple reads but prohibiting rewrites has become available.
Even without this write-once capability, tape is still the best medium for large scale archiving of conventional transactional data that needs occasional but not frequent access, according to Ian Massingham, hosting operations director at Energis2, the Internet arm of telecoms carrier Energis. Discs are too expensive, and optical storage lacks the capacity for his company's Internet-based data archiving service.
But Peter Roberts, sales and marketing director of archive storage supplier IXOS, argues that tape systems are not reliable enough for applications involving smaller volumes of data, especially where for legal or other reasons loss of information could have serious consequences. "We suggest you back up onto something very tried and trusted, almost like having data set in glass so that you can see it, but can't touch it," says Roberts. "Nowadays CD-Rom or Worm drives are more suitable for that than older media such as tapes."
The argument for disc is put by Ajay Lukha, European director of storage system maker Storcase. "Tape may be cheapest, but disc drives are coming down quickly in price, and are definitely in the sweet spot for price/performance," says Lukha. They should be chosen, therefore, for archived data that needs to be accessed frequently, Lukha contends.
These contradictory arguments really reinforce the point that you should first analyse your archiving requirements carefully and then pick the best horse for the course. In many cases this should involve a combination of media, according to Stephen Gerrard of Princeton Softech, another supplier in this field. Many enterprises, says Gerrard, are keeping data that should be archived in their production disc-based systems. This is motivated by the fear that once relegated to tape or some other medium it might prove inaccessible when needed.
Although discs drives are indeed faster than ever, their performance is not increasing as rapidly as the average enterprise's total volume of data. As a result, access times deteriorate as the data mountain accumulates, unless some is backed up regularly to an archive. "You are actually paying a price for keeping all that data every time you run your applications," says Gerrard. "If you can archive some of it safely, you can liberate a lot of computing power, and often roll back plans for expanding computer capacity, saving quite a bit of money."
The recommended solution is then a hierarchical approach whereby data is created in memory, retained in cache if very frequently accessed, then stored in disc drives if untouched for, say, a day. Then after a period of a year, or if it has not been touched at all during a specified period, it can be backed up to tape or perhaps optical storage.
There may be a further migration back to microfilm or paper for information that is never likely to be accessed online again, but whose retention is still deemed desirable. Eventually purely digital mechanisms may take over even for ultra long-term storage of data, but at present there is insufficient trust in the longevity of the media. So currently some organisations with long-termstorage requirements are in a state of transition.
One example is CompaniesHouse, which has been offering its customers onlineInternet access to corporate records since March 1999 and which won the Computer Weekly E-government
Excellence Award thismonth. Companies House has digital records alongside paper-based or microfilm alternatives, and this is likely to continue for the foreseeable future, according to Steve Cryer, a project manager in the IT department. "We will get to the point where we won't have paper, but we will still have microfilmbecause many customers still want it," says Cryer.
But a significant number of organisations are unwilling to trust digital media even for short-term storage and will continue to use paper, particularly for documents where legal ownership or copyright are issues. In such cases IT may still be used to manage and keep track of paper documents, but there would only be one version. According to Nicholas Gomersall, managing director of Acumen Business Solutions, which supplies document management software, many enterprises are still wary of committing all their documents to electronic media, partly because they feel unable to cope with the issues of ageing data and having to relocate periodically to new media. Companies are also deterred by not knowing the most suitable format to store documents in, whether this should be HTML, text, PDF files or some other medium.
Acumen's software can be used to catalogue documents in a mixed environment. The key then, says Gomersall, is to have a well-managed process for delivering documents quickly when users request them.
If transactional data is archived to paper, you must sort out how to represent data inter-dependencies in the archive. Within a production database, such interdependencies are catered for by a combination of metadata providing an index to the information, and application software.
According to Gerrard, it is vital that interdependencies are captured in archived data, otherwise it will prove meaningless in years to come. At least this is recognised in the UK Government's guidelines on electronic records management, with the advice, "Record capture mechanisms should include all necessary metadata needed to access and manage the electronic record throughout the full lifecycle."
Also needed are tools for accessing the archived data effectively and performing searches within it, says Gerrard. "You also need support for granular transactions so if you need to access data that happens to comprise just five or 10 rows, you don't have to restore all 50 million rows of that table."
This affects performance, which rapidly becomes a major consideration as the archive swells in size and online access is required.
It is clear then that the science of digital archiving is still in its infancy. The IT industry is not yet old enoughtohave proved that it has the answers for long term archiving, and to date paper remains the only established medium for holding informationover very long periods. Eventuallysome form of automated staging will be achieved, in which data is maintained in a usable form on current media withouthuman intervention. But long-term data storage remains an ad hoc process of restaging data at arbitrary intervals, as NASA has found.