

The British Library is on a mission to digitise its
immense collection of books and manuscripts. It is a challenge of
mind-boggling proportions, as Helen Beckett
discovered.
The British Library, whose reading rooms have been frequented by
the likes of Karl Marx, Charles Dickens and Virginia Woolf, is on a
very modern mission. Its aim is to preserve the UK's digital
heritage, as well as continue its stewardship of physical tomes.
But the rapid obsolescence of software publishing formats could
make this task more challenging than preserving the Domesday
Book.
One of the world's great treasure troves, the library houses 13
million books, seven million manuscripts and 4.5 million maps, as
well as 3.5 million sound recordings, eight million stamps and 58
million newspapers in various formats. Although most of its content
is ink on paper, Roger Butcher, programme manager for the
Integrated Library System, says, "The future is without doubt
digital."
The library's digital vision is a strategy of two halves: to
keep in perpetuity the UK's digital publication output, and to
provide digital access to the library's body of historical
material.
The latter encourages a more outward-looking and commercial
perspective as documents are delivered to readers all over the
world. Plus, digitisation improves the quality of access. "It is
possible to view the Golf Book - an illuminated manuscript from
about 1540 showing one of the earliest illustrations of a golf
player - much better on the web than through the glass of an
exhibition case," says Butcher.
Digitisation of existing stock is the most straightforward piece
of the strategy, but it is a huge task. Since the project began 10
years ago just 0.3% of the library's collection has been committed
to code. One inhibitor is cost, as replicating an image in binary
code is just a small part of the process. As well as scanning
documents, other components include building the database,
compiling metadata, plus the project management. All told, the
library calculates that it costs £1 to digitise each page and there
are still more than five billion pages to go.
The bulk of the digital conversion is being performed by the
British Library's sister site in Boston Spa, Yorkshire, with the
aid of a scanning package called Relais. However, the library has
also accepted assistance from Microsoft, which estimates it can
halve the cost. To date, Microsoft has pledged to digitise a
further 25 million pages in the form of 100,000 out-of-copyright
books and deliver search results for this content through the MSN
Book Search service. MSN Search will launch an initial public beta
this year.
However arduous the task of digitisation, it is the other piece
of the strategy that really challenges the library - preserving the
UK's digital publication output. "The bit that no one has the
answer for is digital preservation," says Richard Boulderstone, the
library's director of e-strategy and programmes.
Digital material is ephemeral owing to the changing nature of
software publishing formats and the effort of maintaining sites.
Thus, the original Domesday Book is still readable, yet the laser
disc version produced by the BBC is already technologically out of
fashion. Likewise, the paper archives of elections in the last
century are still intact, but the web content that powered this
year's contest has already disappeared from view.
Another problem that the library has to contend with is the lack
of a standard publishing format. Electronic publishers use a
variety of file formats, including Word, Excel, PDF, HTML, JPeg,
Gif and CSV. And though most are moving to XML, they are using
different document definition types. One hundred years from now,
those file formats will be out of date. "Keeping the bits is not
enough - you need to be able to recreate from them something that
readers in decades and centuries to come will be able to use," says
Boulderstone.
The library has scouted around for answers to this conundrum and
has come up with three solutions that will be familiar to anyone
involved in data migration.
The ideal is to create a universal virtual machine that can view
any file format. File format conversion provides a second and more
practical option, taking the old file and converting it into a
format that is readable by the current version of software. Third
is file emulation, where instead of changing the original file
format, the platform on which it was originally viewed is
emulated.
"The purists seem to think file emulation is the way to go, but
it seems very complicated. We prefer to put our efforts into data
migration - it is something we can do here and now," says
Boulderstone.
Whichever technique is selected, the task of preserving digital
heritage will be a large one because of the library's copyright
status in the UK. The Legal Deposit Libraries Act of 2003 entitles
it to a keep a copy of any publication in the UK, thus paving the
way for a national digital library. Online access has to be
carefully managed in order to safeguard any commercial interests of
owners.
"Certain e-journals may likely only be viewed from the reading
room because anything else might undermine the publisher's business
model, and that is not the purpose of a legal deposit," explains
Boulderstone.
Digital rights management, therefore, promises to be an
important component of the Digital Library Programme. "It is very
important to us, going in to the future, that we have a mechanism
to ensure that readers are approved to access different categories
of material," he says.
But the library does not want to be saddled with the manual
cataloguing of every new digital item it receives, and so the plan
is to automatically generate it from the metadata supplied by
publishers. The library is proposing a standard that would embed
reader privileges as part of the metadata so access could be
filtered accordingly from the library's central catalogue.
Fortunately, there is one part of digital preservation that can
be done here and now: storing data objects. Once an item has been
received and validated as authentic, the object is digitally
signed, sealed and stored. The library uses an algorithm-based
digital signing engine from nCipher to create a sealed package that
is stored four times in multiple modes. "We cannot afford to lose
anything, and the probability of that happening with four copies is
minute," says Boulderstone. As prices for disc and tape media move
closer, discs will play a larger part in storage as the library
moves to store 300Tbytes of data over the next five years.
As the library team mulls over the complexities of a digital
future, it can draw confidence from the fact that the management of
its physical stock is in good shape following the implementation of
a new library system from Ex Libris. The Israeli-developed software
package, called Automated Library Expandable Program (Aleph), was
adopted in 2000 with the purpose of replacing 16 legacy systems
that ran the core functions of managing and locating stock and
maintaining a catalogue.
Reducing cost was a big driver for the upgrade. Legacy
applications were in a gamut of languages from Assembler to Cobol
and were very expensive to maintain as some were over 30 years old.
In some instances, when the packages were retired the people
supporting it took the opportunity to retire too. "In one case we
were the only people in the world using the package," says Butcher
of the Elhill search engine, originally created in the 1960s by the
National Library of Medicine in the US.
As well as cutting costs, there was the desire to be part of a
modern information network. "As a national library, historically we
had felt 'different' and had built a lot of custom packages to
reflect our unique status. However, we took the positive view that
we now wanted to be the same," says Butcher. "A lot of our
functions are not so different from other libraries. Rather than go
the bespoke route of modifying packages, we wanted to have the
benefit of sharing experiences with other users."
The British Library put out an invite to tender and evaluated
five contenders. An end-user requirements group consisting of
library staff and readers sampled products and gave feedback on
aspects including the intuitiveness of the user interface and
search function. A priority was to choose a system with a workflow
system that worked for the library's diverse staff. "With our
legacy systems, we had many people working in silos. With one
integrated system we had the chance to reorganise workflows and a
big issue was how these systems looked," says Butcher.
Although keen to join the mainstream with its system, the
British Library was conscious that its size could complicate
matters. It drew comfort from the fact that two other library
colossi, the Russian State Library and the National Library of
China, were users of the Ex Libris system. Another plus for the
product was the work Harvard University had done with Ex Libris on
streamlining the workflow. The British Library acquires 100,000
books and 300,000 journals each year from legal deposit alone. "The
streamlined workflow means we are able to cope with these sorts of
volumes," says Butcher.
The integrated package was rolled out and the legacy systems
phased out, but work is still in progress. The latest phase is
incorporating the library's 4.5 million manuscripts, and 750,000
are now in the system. However, this phase brings fresh challenges,
as manuscripts are unique and filed just once by their authors. For
this reason they need far more descriptive records, and the normal
short metadata tag provided by the library system is insufficient
for the wordy descriptions that are normally filed.
Whatever technical challenges in building a digital future lie
ahead for the national library, it can face them as part of a
network of peers. Since its decision to use the Ex Libris system,
the British Library has hosted the Aleph annual user conference and
has been pleased with the rewards of community, says Butcher. "We
feel as if we have come in from the cold. Instead of being on our
own we can look to a worldwide pool of expertise to help solve
problems."
Ex Libris library software
The development of Ex Libris’ Automated Library Expandable
Program (Aleph) system started in 1980, when a team of librarians,
systems analysts and programmers took on the challenge of creating
an automated library system that was efficient, user-friendly, and
multilingual.
Prior to Aleph, the ability to search and display texts in
different languages and scripts, such as Latin, Greek or Arabic,
was an immense task, further complicated by the fact that in some
scripts, characters change according to the character that precedes
it.
Four years ago, the emergence of Unicode as a universal standard
to deal with multiple character sets removed Aleph’s unique selling
proposition. Since then, Ex Libris has focused on developing
functionality that enables libraries to better share and deliver
online material.
Initially Ex Libris released SFX, which constructs and embeds
hyperlinks to electronic resources. A year later it launched
Metalib, a generic search interface that functions across a broad
pool of material globally. Digitool, released in 2002, turns
libraries, archives and museums into publishers and enables them to
share collections while protecting content and copyright.