Frdric Prochasson - Fotolia

TfL shifts corporate archive to AWS public cloud in digital preservation push

London transport authority has begun moving its first 3TB of corporate archive data to the AWS cloud with the help of Preservica

About 35p of every pound that passengers spend with Transport for London (TfL) is reinvested in projects designed to future-proof the capital’s public transport system.

The organisation is responsible for overseeing the day-to-day running of the London Underground, Overground, bus, river and road network, as well as the city’s cycle hire scheme.

And as the number of people who rely on these services to navigate the city grows year by year, it is TfL’s job to ensure the transport network can cope in terms of capacity and performance in the long term.

In recent times, this has seen TfL embark on lengthy renovation projects at a number of London Underground stations and play an active role in the high-profile Crossrail development, while pushing ahead with its regular schedule of engineering works.

Then there are events such as the London 2012 Olympic Games, which may require TfL to lay on additional services or launch advertising campaigns to advise people how best to get round the city when there are millions more people using its network than normal.

Some of these projects require years of preparation and, in the case of Crossrail, stand to have a transformative impact on not only how the city works, but also how it looks.

All the work that goes into these projects is recorded through planning documents, photographs and other miscellaneous paperwork, all of which needs to be retained for legal, procedural and cultural history reasons, says TfL corporate archivist Tamara Thornhill.

In the case of the London Olympics, all the planning and strategy documents, passenger research and transport maps that were created for that event are frequently requested by other cities around the world that are staging events of a similar size and scale.

“There is a whole range of reasons why people want to use the archive and the information inside it,” said Thornhill. “Over the past five years, we have seen a 500% increase in our internal user base, and they are coming to us to help prove who owns the rights to certain patents or to work out where the boundaries may lie when it comes to solving disputes around new land investments within TfL.

“Sometimes it’s because they’re revisiting projects that were ruled against pursuing 20 years ago, and need to know what the justification was at the time for not pressing ahead with it.”  

History in the making

The organisation also needs to keep hold of any corporate information relating to its own development, the roots of which date back to the 19th century, she said.

“TfL has only existed since 2000, but it is just the latest incarnation of a transport authority in London that dates back to 1902. Then, if you start to look at the individual railway and bus companies, the history of the organisation goes back even further – to the 1820s.

“All the records we have reference the development of this organisation and how it has gone about its business, why it has made some of the decisions it has, how it has involved the public in those decisions, and how all this has impacted on the environment and the city as a whole.”

Part of Thornhill’s job is to ensure that the paper and digital records associated with all this work are correctly catalogued, digitised (if they need to be) and are in a good administrative state, so they can be readily accessed by the people who need them.

However, given the age and size of TfL, not to mention the breadth of its current responsibilities, it can be a challenge to get hold of this information so that it can be securely stowed away in the archive.  

This task is made harder by the fact that the data Thornhill’s team requires is often stored on various types of media – including removable hard drives, USB sticks, shared network drives and electronic document and record management systems (EDRMS) – in TfL’s many departments.

“The data is really spread about all over the place – but that’s just the data we know about, have access to or responsibility for,” she said. “It amounts to around 3TB at the moment, but that’s really just scratching the surface.”

Protecting and preserving

In recent months, Thornhill’s department has begun a drive to shift the data it does know about to a centralised, off-premise repository within the Amazon Web Services (AWS) public cloud, with the help of digital preservation software provider Preservica.  

Obviously, there is a data backup slant to why this work is being undertaken, but there is a more pressing need to ensure all the information is digitised into file formats that will remain readable long into the future, she said.

“Paper records you can store in a secure environment, and it’s pretty much going to be okay in there if you close the door and open it again in 200 years, whereas with digital records, you need to guard against format obsolescence, as new types of hardware and software come onto the market all the time,” said Thornhill.

That is one of the reasons why TfL has chosen to use Preservica’s cloud-based digital preservation system rather than a more traditional content management system, because it allows end-users to regularly scan their data stores to see if any of the file formats they use are at risk of becoming outdated.

If they are, the company offers them up to 300 different tools with which to update their file formats to ensure the data remains accessible for future generations.

Read more about digital preservation projects

“Preservica was pretty much the only provider that met all our requirements, because we needed something that would deal with all the different systems on the TfL estate and would allow us to automate the way some of this data is ingested into the digital preservation system, while still providing us with an option to manually intervene should we have to,” said Thornhill.

Ensuring that this data remains accessible for many years to come is also one of the reasons why Preservica hosts its system in the AWS cloud, said Preservica CEO Jon Tilbury.

“AWS has such a high level of investment in its cloud, which suggests it is likely to be around for a very long time,” he said.

“Also, its storage is so durable, each file stored in three different datacentres with multiple copies in each one, and they are constantly checked. So, when one server fails, they can self-heal from one of the others.”

Access all areas

At the moment, the archiving team is still in the process of establishing rules and workflows within the Preservica system that will govern how involved Thornhill’s team needs to be in moving data there, which will largely come down to what type of information it is and who created it.

“We haven’t got huge amounts of data in there at the moment, as we’ve been testing to see what kinds of workflow we need to have in place in the application, but we’re certainly in a position now to start ingesting large amounts of material,” said Thornhill.

For the time being, her department will focus on getting that first 3TB into the cloud for preservation purposes, before turning their attention to making it accessible to internal TfL stakeholders, as well as the general public.

“The internal access will come first, because  our primary focus is on serving the business, but we want the public to be able to have access to as much of it as possible, although they will never be able to access it all for security reasons,” she added.

The corporate archive department also plans to plough on with its regular in-house awareness campaigns that are geared towards uncovering even more archivable content stored on various staff devices.

In a similar vein, the corporate archive department has also started working closely with TfL’s record management team, who help them identify documents within certain departments that need to be preserved for ever.

“It’s all about getting a handle on where the files are and how we can start to do something about preserving them, but we’re such a huge organisation, it’s going to take a long time to work around all the areas of our business,” said Thornhill.

Read more on Infrastructure-as-a-Service (IaaS)

Data Center
Data Management