The British Library is about to start archiving 4.8 million UK websites and one billion web pages, following legislation that will come into force this week.
The British Library has piloted the archive project aimed at preserving a historical record of British web activity and will now be free to take a snapshot of the internet every year across the UK without falling in breach of copyright law.
Planning for the project has taken 10 years. When complete, visitors will have access to the archive database.
The move follows statutory changes which mean the British Library will not breach copyright law under the Legal Deposit Libraries (Non-Print Works) Regulations 2013.
The British Library has been using open source software Heritrix to crawl the web, with management software and quality assurance software to manage the search engine developed in-house. The library is budgeting approximately 100 terabytes storage for the first web crawl.
The Web Curator Tool software used to manage the crawling engine was developed jointly by the British Library and National Library of New Zealand.
Richard Gibby, project leader said: “The scale is something we’ve learnt a lot about. We are almost stretching the limits of the software.
"For example, things like the metadata and logs the crawling software records are becoming huge and bigger than we anticipated. So there were issues around how we set up disk space on servers to simply log the metadata. Not overloading the webserver, that is the sort of thing been testing at this scale.
“Clearly the challenge is now to do the first real actual crawl, which will be starting next week. It will take several months of effort to go through the process, with the results available in January 2014.
“This will be a real treasure trove to allow grandchildren to look back and understand how life was lived, what we cared about and said and felt. It is a real treasure trove of information. The aim is to build that incredible picture for future generations.”