Preserving our digital heritage

The British Library is on a mission to digitise its immense collection of books and manuscripts. It is a challenge of mind-boggling proportions, as Helen Beckett discovered

british library_150.jpg

The British Library is on a mission to digitise its immense collection of books and manuscripts. It is a challenge of mind-boggling proportions, as Helen Beckett discovered.

The British Library, whose reading rooms have been frequented by the likes of Karl Marx, Charles Dickens and Virginia Woolf, is on a very modern mission. Its aim is to preserve the UK's digital heritage, as well as continue its stewardship of physical tomes. But the rapid obsolescence of software publishing formats could make this task more challenging than preserving the Domesday Book. 

One of the world's great treasure troves, the library houses 13 million books, seven million manuscripts and 4.5 million maps, as well as 3.5 million sound recordings, eight million stamps and 58 million newspapers in various formats. Although most of its content is ink on paper, Roger Butcher, programme manager for the Integrated Library System, says, "The future is without doubt digital."

 The library's digital vision is a strategy of two halves: to keep in perpetuity the UK's digital publication output, and to provide digital access to the library's body of historical material. 

The latter encourages a more outward-looking and commercial perspective as documents are delivered to readers all over the world. Plus, digitisation improves the quality of access. "It is possible to view the Golf Book - an illuminated manuscript from about 1540 showing one of the earliest illustrations of a golf player - much better on the web than through the glass of an exhibition case," says Butcher.

Digitisation of existing stock is the most straightforward piece of the strategy, but it is a huge task. Since the project began 10 years ago just 0.3% of the library's collection has been committed to code. One inhibitor is cost, as replicating an image in binary code is just a small part of the process. As well as scanning documents, other components include building the database, compiling metadata, plus the project management. All told, the library calculates that it costs £1 to digitise each page and there are still more than five billion pages to go. 

The bulk of the digital conversion is being performed by the British Library's sister site in Boston Spa, Yorkshire, with the aid of a scanning package called Relais. However, the library has also accepted assistance from Microsoft, which estimates it can halve the cost. To date, Microsoft has pledged to digitise a further 25 million pages in the form of 100,000 out-of-copyright books and deliver search results for this content through the MSN Book Search service. MSN Search will launch an initial public beta this year. 

However arduous the task of digitisation, it is the other piece of the strategy that really challenges the library - preserving the UK's digital publication output. "The bit that no one has the answer for is digital preservation," says Richard Boulderstone, the library's director of e-strategy and programmes.

 Digital material is ephemeral owing to the changing nature of software publishing formats and the effort of maintaining sites. Thus, the original Domesday Book is still readable, yet the laser disc version produced by the BBC is already technologically out of fashion. Likewise, the paper archives of elections in the last century are still intact, but the web content that powered this year's contest has already disappeared from view.

Another problem that the library has to contend with is the lack of a standard publishing format. Electronic publishers use a variety of file formats, including Word, Excel, PDF, HTML, JPeg, Gif and CSV. And though most are moving to XML, they are using different document definition types. One hundred years from now, those file formats will be out of date. "Keeping the bits is not enough - you need to be able to recreate from them something that readers in decades and centuries to come will be able to use," says Boulderstone.

The library has scouted around for answers to this conundrum and has come up with three solutions that will be familiar to anyone involved in data migration.

The ideal is to create a universal virtual machine that can view any file format. File format conversion provides a second and more practical option, taking the old file and converting it into a format that is readable by the current version of software. Third is file emulation, where instead of changing the original file format, the platform on which it was originally viewed is emulated. 

"The purists seem to think file emulation is the way to go, but it seems very complicated. We prefer to put our efforts into data migration - it is something we can do here and now," says Boulderstone. 

Whichever technique is selected, the task of preserving digital heritage will be a large one because of the library's copyright status in the UK. The Legal Deposit Libraries Act of 2003 entitles it to a keep a copy of any publication in the UK, thus paving the way for a national digital library. Online access has to be carefully managed in order to safeguard any commercial interests of owners.

"Certain e-journals may likely only be viewed from the reading room because anything else might undermine the publisher's business model, and that is not the purpose of a legal deposit," explains Boulderstone. 

Digital rights management, therefore, promises to be an important component of the Digital Library Programme. "It is very important to us, going in to the future, that we have a mechanism to ensure that readers are approved to access different categories of material," he says.

But the library does not want to be saddled with the manual cataloguing of every new digital item it receives, and so the plan is to automatically generate it from the metadata supplied by publishers. The library is proposing a standard that would embed reader privileges as part of the metadata so access could be filtered accordingly from the library's central catalogue.

Fortunately, there is one part of digital preservation that can be done here and now: storing data objects. Once an item has been received and validated as authentic, the object is digitally signed, sealed and stored. The library uses an algorithm-based digital signing engine from nCipher to create a sealed package that is stored four times in multiple modes. "We cannot afford to lose anything, and the probability of that happening with four copies is minute," says Boulderstone. As prices for disc and tape media move closer, discs will play a larger part in storage as the library moves to store 300Tbytes of data over the next five years.

As the library team mulls over the complexities of a digital future, it can draw confidence from the fact that the management of its physical stock is in good shape following the implementation of a new library system from Ex Libris. The Israeli-developed software package, called Automated Library Expandable Program (Aleph), was adopted in 2000 with the purpose of replacing 16 legacy systems that ran the core functions of managing and locating stock and maintaining a catalogue.

Reducing cost was a big driver for the upgrade. Legacy applications were in a gamut of languages from Assembler to Cobol and were very expensive to maintain as some were over 30 years old. In some instances, when the packages were retired the people supporting it took the opportunity to retire too. "In one case we were the only people in the world using the package," says Butcher of the Elhill search engine, originally created in the 1960s by the National Library of Medicine in the US.

As well as cutting costs, there was the desire to be part of a modern information network. "As a national library, historically we had felt 'different' and had built a lot of custom packages to reflect our unique status. However, we took the positive view that we now wanted to be the same," says Butcher. "A lot of our functions are not so different from other libraries. Rather than go the bespoke route of modifying packages, we wanted to have the benefit of sharing experiences with other users." 

The British Library put out an invite to tender and evaluated five contenders. An end-user requirements group consisting of library staff and readers sampled products and gave feedback on aspects including the intuitiveness of the user interface and search function. A priority was to choose a system with a workflow system that worked for the library's diverse staff. "With our legacy systems, we had many people working in silos. With one integrated system we had the chance to reorganise workflows and a big issue was how these systems looked," says Butcher.

Although keen to join the mainstream with its system, the British Library was conscious that its size could complicate matters. It drew comfort from the fact that two other library colossi, the Russian State Library and the National Library of China, were users of the Ex Libris system. Another plus for the product was the work Harvard University had done with Ex Libris on streamlining the workflow. The British Library acquires 100,000 books and 300,000 journals each year from legal deposit alone. "The streamlined workflow means we are able to cope with these sorts of volumes," says Butcher.

The integrated package was rolled out and the legacy systems phased out, but work is still in progress. The latest phase is incorporating the library's 4.5 million manuscripts, and 750,000 are now in the system. However, this phase brings fresh challenges, as manuscripts are unique and filed just once by their authors. For this reason they need far more descriptive records, and the normal short metadata tag provided by the library system is insufficient for the wordy descriptions that are normally filed.

Whatever technical challenges in building a digital future lie ahead for the national library, it can face them as part of a network of peers. Since its decision to use the Ex Libris system, the British Library has hosted the Aleph annual user conference and has been pleased with the rewards of community, says Butcher. "We feel as if we have come in from the cold. Instead of being on our own we can look to a worldwide pool of expertise to help solve problems."

Ex Libris library software

The development of Ex Libris’ Automated Library Expandable Program (Aleph) system started in 1980, when a team of librarians, systems analysts and programmers took on the challenge of creating an automated library system that was efficient, user-friendly, and multilingual.

Prior to Aleph, the ability to search and display texts in different languages and scripts, such as Latin, Greek or Arabic, was an immense task, further complicated by the fact that in some scripts, characters change according to the character that precedes it.

Four years ago, the emergence of Unicode as a universal standard to deal with multiple character sets removed Aleph’s unique selling proposition. Since then, Ex Libris has focused on developing functionality that enables libraries to better share and deliver online material.

Initially Ex Libris released SFX, which constructs and embeds hyperlinks to electronic resources. A year later it launched Metalib, a generic search interface that functions across a broad pool of material globally. Digitool, released in 2002, turns libraries, archives and museums into publishers and enables them to share collections while protecting content and copyright.

Read more on IT risk management