Case Study: Digitising the British Library

The British Library has a reputation for innovation, but despite its on-going digitisation project the institution has so far made just 1% of its vast 5 billion page catalogue digital. Kathleen Hall speaks to the organisation about some of the key projects powering its digital preservation programme.

The British Library has a reputation for innovation, but despite its on-going digitisation project the institution has so far made just 1% of its vast 5 billion page catalogue digital. Kathleen Hall speaks to the organisation about some of the key projects powering its digital preservation programme.

“To date we’ve digitised 1% of the collection – and that’s nearly 20 years of endeavour,” says Richard Boulderstone Director of e-Strategy and Information Systems at The British Library. “We don’t have a set date for when we expect the full catalogue to be digital, but in in truth it will be beyond our lifetimes.” 

At the moment processing data is an upward struggle for the British Library, as information is coming in faster than it can process. “Over time content coming to us will be in digital formats. So we are accelerating digitisation at ever increasing rates,” says Boulderstone.

The library has been digitising its catalogue since the early 90s, with Kindle owners now able to view first editions from the likes of Charles Dickens, Jane Austen in the original typefaces, with original illustrations. It currently has a three-figure list of digitisation projects underway, but the most pressing issue for the organisation is the limitation of funds. The library is primarily funded by the department for Culture Media and Sport, which previously helped fund projects but in the current environment of austerity it is now almost exclusively commercial organisations, says Boulderstone.

The solution is to do a few million pages at a time. Historically partnerships include projects with Microsoft, archive digitisation company ProQuest and Google. “The library’s project with brightsolid [an online publishing company] consisted of 40 million pages over the next two years for our newspapers, and Google is digitising around 250,000 books out of copy right books, totalling around 40 million pages. But that’s relatively modest compared to the 15 million books we have,” he says.

“I don’t think many companies can come in with the scale of Google. But there are smaller initiatives out there. Often about a single item in the library and having a sponsor or benefactor interested in making the documents available to a larger audience,” says Boulderstone.

Preservation has been around its historic print and text catalogue. “We’ve got a fantastic map collection, and if anyone wants to spend money on that, that would be fantastic. But such things are tricky to digitise. For example, we would like to Geo code it but the tools to do that are very limited,” says Boulderstone.

Unlike other countries, there is less incentive to preserve records of the UK’s language for posterity. “The ubiquity of the English language can make it challenging for the library to receive dedicated funding, unlike other EU countries that have made a statement they want to digitise their own language – for example France,” says Neil Fitzgerald, project manager at the library.

Digital scanning

Fitzgerald has recently finished a overseeing a digital scanning project. Although the Improving Access to Text (IMPACT) project has not yielded significant additional content to the library’s digital catalogue, it will provide an important tool for researchers looking to data mine its digital collection, he says.

“The scanning process is similar to creating a picture book, which is just taking images. But as a picture book it’s not that useful, as that does not enable key word searches,” adds Boulderstone.

“For R&D purposes we collated around 600,000 existing pages from all the libraries working in the project from previous digitisation initiatives,” he says. From the British Library this included a mixture of books, newspapers and single sheet material published between 17th to 19th Century. 

The library worked with Optical Character Recognition (OCR) specialist ABBY on the project, which was a major European Commission funded programme with 26 partners across 13 countries. It received more than $12 funding for the project.

“The process of converting a document [into a digitally-readable text] is much more than the character recognition part. It’s about image pre-processing, image cleaning, analysing the structure of images. Once we’ve found that we have to break down and read characters individually, then we have to re-assemble the document using a dictionary,” says Michael Fuchs, director at ABBY.

The library hopes to enable parts of its catalogue to be made fully text-searchable on mobile phones.

The four-year IMPACT project will come to an the end by December and will be replaced by a Centre of Competence, which aims to improve the digitisation of historical printed text in Europe and further develop and foster the research of the project. But the Centre of Competence will need to be a self-sustaining entity and so will be looking for partnerships with private sector organisations seeking to digitise material, says Blouderstone.

Future digitisation strategy

The library is keen to digitise 20th century material but can’t at the moment because of copy right – but this issue of text mining is up for discussion under the Hargreaves Review.  “In terms of academic purposes this ability would be of high interest to our users,” says Boulderstone

Other regulatory changes underway include web archiving under legal deposit regulations. The Libraries Act 2003 allowed us to store print and digital formats. The law has been extended and we’ve been working with government, publishers and others for last eight years to try and get the regulations changed to allow us to store web archives.

The library is hoping to see the laws on the retention of digital material to change this year. Then it will be able to carry out web trawls from UK domains. This is estimated to reach around 150 terabytes per crawl. “We are hoping we will get the ability before the Olympics, as there are a lot of websites and information which will have a short life related to that. It’s important to preserve more ephemeral things like websites that don’t last long life as that will be interesting to researchers looking back in history,” says Boulderstone.

It expects to conduct trawls up to two times per year, but this will depend on its storage budgets as 150tb amounts to a lot of data. “Fortunately the scale and cost of disk space storage has dramatically fallen, with costs having already fallen by 20% since last year,” he says.

However, the library’s ambitions far exceed the percentage it is declining by so financial constraints will still apply. “Digital projects are storage hungry by their very nature, one page alone is around 10mbs as the scanning needs to be high quality. While capacity is getting cheaper our desire to store is increasing five times that,” says Boulderstone.

Over the last five years the library has built a digital library storage to preserve its work to date. “This is to ensure the content we have is stored forever and not corrupted or lost. We have four complete copies of content in London, Yorkshire, Aberystwyth in Wales and Edinburgh. Each of these copies contains strategic collection. Everything signed with high encryption rating, so it’s a secure environment for long-term preservation,” he says.

According to the library’s chief executive Dame Lynne Brindley, just 25% of the world’s books will be published in print form alone by 2020. This is one of the reasons the institution has emphasised the importance of its digital transition.

“Around two years ago the iPad did not exist, and that has changed the way people consume digital documents.” He says browsing historical documents is moving away from the preserce of academics to entertainment and for hobbyists. The content from Microsoft’s project is now available to be accessed through the iPad.  “In a few years people will expect to access catalogues in this way,” he says.

Box out: Why digitise?

In its mission statement, Dame Lyne Brindley, chief executive of the British Library notes: “Innovation in the technology landscape has led to a work in which the creation, storage access and dissemination of knowledge have been completely and irrevocably changed. In the past, we have developed three-year visions and accompanying strategies setting out how we would achieve our vision. Given the enormity of recent changes (just remember the iPhone, Facebook, YouTube and Twitter did not exist ten years ago), we think it important to look ten years ahead.

Five key themes of the library’s strategic priorities for 2020:

Guarantee access for future generations

Enable access to everyone who wants to do research

Support research communities in key areas for social and economic benefits

Enrich the cultural life of the nation

Lead and collaborate  in growing the world’s knowledge base




More articles on British Library digital archiving:

Lessons from the library: Behind the UK's web archive

Companies must plan for digital preservation, says British Library

Video: How the British Library plans to support research in years to come




Read more on IT innovation, research and development