Sourcing data from way back

The Alexa Internet archive stops big, powerful companies from trying to bury events

The Alexa Internet archive stops big, powerful companies from trying to bury events

Brewster Kahle is one of the true, if largely unsung, heroes of the Internet revolution. In 1989, he invented the Wide Area Information Server (WAIS) system, that allowed global searches to be performed across the Internet.

Had the Web not come along soon afterwards, it is quite possible that the online searches we conduct today would be using WAIS technology. Instead, WAIS failed to take off, and now there are only a few vestiges left, such as some WAIS server software, and a Web-WAIS gateway to the few remaining WAIS databases.

Undaunted, Kahle went on to found Alexa in 1996.

As this column has described previously, this is a type of browser add-on that offers information about Web sites and gives suggestions about related pages that might be worth visiting.

Once again Kahle was well ahead of the pack. His way of monitoring Alexa users' choices to generate lists of further sites that might be of interest is very close to the method that the later Google search engine would adopt to rank the hits in its search engine.

But, in retrospect, another of Alexa's activities may prove more important.

Alongside its various free product offerings, Alexa has not only used the activities of users to create lists of links, but it has also stored the Web pages they visit over the years to create a huge Web archive.

Alexa is currently gathering over 250Gbytes of data each day, which it holds on GNU/Linux boxes that provide storage at just $4,000/Tbytes.

This Internet Archive has hitherto only been available to scholars, but with the launch of the Wayback Machine this resource is available to all through an easy-to-use Web interface. There is further background to the Wayback Machine at and to the Archive.

There are number of special collections, including one that holds early documents about the birth of the Arpanet and Internet. Ironically, this requires a special Djvu plug-in from LizardTech, and does not employ HTML.

Happily, the archive of Web pioneers sticks with standard HTML pages, and offers a fascinating glimpse of the Web as it was about five years ago.

The whole 100Tbytes archive of Web pages can be accessed from simply by entering the URL of a site. This brings up a complete listing of snapshots, arranged in chronological order. Clicking on these brings up the page on that date. Impressively, many of the internal links still work too.

There are also advanced search options, which are particularly powerful. For example, entering*/* brings up a list of all the archived pages from Microsoft's main Web site - some 20,000 of them.

This is a splendid resource, and one that will provide hours of education - and entertainment - to Internet users. For example, companies will doubtless find it useful to visit again how both their own and rivals' pages have developed over the years, and to learn from classic sites such as Yahoo or

The Wayback Machine will be invaluable to Web historians who wish to track how things developed. But it is important to emphasise that this is not just an academic exercise in Net archaeology.

This resource, for all its obvious faults in terms of incompleteness, offers a chance to establish who really did say what when, and to put a brake on any online player - particularly big and powerful ones - that might try to rewrite history by changing and discarding inconvenient Web pages.

Next week: The settlement

Read more on Database software

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.