British Library archives UK websites with new software

The British Library has completed a two-year trial of web archiving technology that will see it preserve terabytes of information for future generations.

The British Library has completed a two-year trial of web archiving technology that will see it preserve terabytes of information for future generations.

It is preserving information on events such as the credit crunch, Antony Gormley's Trafalgar Square Fourth Plinth Project, and material on the 2010 General Election for future research.

The 400-year-old library has an obligation to archive relevant information for future generations and research, and over the past 10 years public information has increasingly been stored on the internet. But many websites are only live for between 44 and 75 days and much of the information is at danger of being lost as the sites are closed.

The library is using IBM software called BigSheets to search web pages, extract relevant data and analyse it for patterns and trends. It will need to repeat searches every 40 days.

The library was given permission in 2003 to crawl through and capture information from between 5,000 and 6,000 UK websites, which produced more than 50Tbytes of data.

The library will need further legal permission before it has the right to crawl through the whole UK web domain. If granted, the work will produce an estimated 128Tbytes of data from around eight million websites.

Some UK companies use the .com domain address, which will add further to the data it needs to store.

To archive the information clearly, the software needs to crawl through the websites, bring down the pages and then confirm that the content it believes is on there actually is. It must also look at the content and decide where the value is, processing it and presenting the data in a way that is useful to researchers.

David Boloker, CTO for emerging internet technology at IBM, said, "At some point in the future, people are going to want to look at the patterns shown through these websites. The software allows you to find all the relationships of the data, from the most simplistic keyword searches to looking at it more semantically - for example, not just whether the Conservatives are mentioned, but whether they are mentioned in a satirical way."

The software will have implications for many organisations, including the BBC and GlaxoSmithKline, which could use it to search the results of drugs trials, Boloker said.

"We are on data overload, and the key question is how we get beyond the sea of information and actually find the important pieces."

Read more on Integration software and middleware

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

SearchCIO

SearchSecurity

SearchNetworking

SearchDataCenter

SearchDataManagement

Close