Your data is only as good as the media it is stored on and the format that it is stored in, writes Danny Bradbury – and in the long term, both are transient at best
Archaeologists don’t just dig in the ground these days; they dig in the sky too. Alexander Rose, executive director of California-based think tank The Long Now Foundation, explains that in its early days, Nasa sent signals back from satellites to research centres that stored the data on tapes using custom-built machines. Years later, archaeologists asked to examine the data to see what had changed on the earth.
“Nasa thought they could get the data out, but it turns out they did not have the machines to read it. Once they cobbled together the machines, they found out the tapes had degraded to the point where they could not get all the information off,” says Rose.
When it used error correction to help rebuild the damaged data, the US space agency realised it did not know how it was organised, and all the people who originally worked on the data had retired or died.
Digital technology was meant to make us more information-rich, but archivists such as Rose use the term “digital dark age” to describe the early 2000s. There is more data produced and stored than ever before, but storage media is volatile, and data storage formats doubly so.
Load the DVD you burned last week, and the data will hopefully still be there. But what about in five years? Or 10 years? Or 30?
“It depends on what environment they are being kept in,” says Andy Maurice, head of consultancy at records management company Iron Mountain. When CDs were launched in the early 1980s they were lauded as permanent.
“We have seen recently that people with CDs have found some form of bloom on them, which means they are decaying,” he says. When optical media holds your corporate data, it could well last for decades, especially in a clean-room environment, but it might not. How much are you willing to risk?
There is a similar problem with magnetic media, says Brewster Kahle, head of the Internet Archive, a project that tries to preserve as much data from the web as possible.
“I cannot get floppy drives from 10 years ago to work, and hard drives die too. You have to keep refreshing the media,” he says.
Floppy discs and their cousins, such as high-density Zip drives, were inherently vulnerable because the read head touched the surface of the disc.
Even though hard drive heads float above the disc without touching it, they are also vulnerable. Higher data rates can lead to self-demagnetisation as minute magnetic cells flip their polarity – something known as the super-paramagnetic effect. The surface of the media can corrode, or particle contamination can kill the drive. Adhesives used to hold components of the drive together can also break down, leading to rubbing or contact with the disc.
Some companies are taking a long-term view of the problem. Iomega’s Rev system, introduced in 2004, is a removable disc sealed in a cartridge containing the media and the spindle, with the rest of the electronics in a separate reader. The disc and reader are sealed to reduce contamination, giving the media what Robert Lutz, worldwide product manager at Rev technology, claims is a 30-year life.
The reader is unlikely to last as long as the media, he admits, but adds that because the media uses standard 2.5in discs inside the otherwise proprietary cartridge format, a savvy engineer could cobble together a read/write device 20 years hence if Iomega was no more and the Rev drive no longer existed. However, for your average archivist wanting a perpetual storage mechanism without the hassle of reconstructing hardware, that may be cold comfort.
While some people worry about media quality, Tony Dearsley worries about media formats. As senior consultant on corporate projects at computer forensics firm Vogon, he is often asked to trawl through historical data.
“Often people have archived material but it is in a tape format that they do not have the equipment to read it with any more,” he says.
Tape drives are a problem because there are so many different types of tape format. “Take a 3590 drive, which has variations B, C, D and E. None of them are compatible with each other,” says Dearsley. He maintains a collection of drives covering the past 25 years simply to read customer data.
Surely users can just keep old versions of media readers in pristine condition themselves? Sure, says Maurice, just so long as you maintain the correct version of the back-up application, the operating system and the machine to run it on – and then replicate that for each different media format you have.
One obvious way to solve the problem is frequent migration. Migrating your data from one storage medium to another every few years will help to ensure it stays current. Iron Mountain offers a media transcription service for corporate data. Transferring data between media helps with preservation but also increases space efficiency, says Maurice. “Technology is moving so rapidly that what you stored on 1,000 tapes five years ago can probably fit onto 10 modern format tapes.”
But replacement becomes more difficult and expensive as the volume of data that a company stores increases. Marshalling the storage and back-up of data in a single place can be a daunting prospect when you move from gigabytes to terabytes and then into petabytes. Companies already have vast amounts of data stored, but with the ubiquitous storage of sensor and video data just a few years away, the problem will become increasingly urgent.
Decentralisation – farming out the archiving process to the organisations or divisions that create it – could be an interim solution. The decentralisation discussion becomes increasingly relevant as you begin dealing with companies whose remit is to gather and preserve huge amounts of data from the public sphere, such as archives and libraries.
“The problem we are facing is that by the time the large organisations get some material that they are potentially interested in, if it is an obsolete technology and it has no documentation they cannot preserve it even if they want to,” says Maggie Jones, former executive director of the Digital Preservation Coalition, an organisation that promotes good practice for digital archivists.
“We need to start spreading out that responsibility so more people are responsibly creating digital information.”
For companies trying to preserve increasing volumes of data, another likely strategy involves triage. Prioritising the data it needs to store or discard helps a company manage the volume of information residing on its drives.
Companies will need to refer to industry regulations when determining this, and also refer to the data itself, assessing its value to the business.
Assuming the physical storage format for a piece of data can be preserved, there is then the problem of preserving the logical format. The formats used to produce data by some of today’s applications may be ubiquitous, but it is difficult to tell if they will still be readable in 2020, or 2050.
“If you try to read a document written in Wordperfect 3.2, that is probably very difficult, and that was only 10 years ago,” says Kahle.
The key to preserving file formats for future use after the applications are long dead is to use metadata, says Kevin Schurer, director of the UK Data Archive, which works with the National Archives to store social science data.
Storing metadata about the data format that you are preserving enables you to understand it later on. It could give future archivists the opportunity to create an application that will consume it or transcribe it into a contemporary format.
“XML holds a lot of promise, in that it is flexible. It is just rendering things with tag markups,” says Schurer, adding that the UK Data Archive preserves its information using an XML document schema developed under the Data Documentation Initiative, an international effort to define a standard way of storing social science data.
“We use that because it allows data and metadata to be tagged up within a common file, so that you can preserve the metadata and data alongside each other,” he adds. “Another advantage of an XML type approach is that most XML files are rendered to an ASCII character set, meaning they are easier to preserve.”
Using an agreed format is a wise move. Standardisation is an important tool in any archivist’s arsenal, says Kahle.
When dealing with media such as videos and photos, for example, it helps to have a published format that everyone uses, such as MPeg or JPeg, with the encoding and decoding requirements publicly available so future consumers can work out how to use them. “Even proprietary formats supported by large companies go unsupported after a while,” he says.
Sometimes, of course, the popular and the proprietary go hand in hand. Microsoft has maintained an iron grip on the desktop market for years, meaning that the majority of documents have been saved in its proprietary binary formats. Its shift into native XML for Office 2007 is a step in the right direction.
The company is trying to push the format through international standards organisation ECMA for standardisation, but the rest of the industry is working on the Open Document Format, another XML standard for storing Office documents. Microsoft is not playing ball with standards body Oasis to fold the Open Document Format into Office.
The dispute over the formats has been so bitter that it led to the resignation of Peter Quinn, CIO for the State of Massachusetts. His decision to adopt the Open Document Format caused such adverse reactions that observers felt he was put in an untenable position.
Such rivalries may seem bitter, but they will fade into history, leaving only a legacy of inconsistent record formats. In 200 years’ time archivists trying to read today’s documents will have more unravelling to do.
And yet it is that vision, which thinks about the life of data over hundreds of years rather than decades, that some of today’s software and storage companies have yet to grasp.
Rose encapsulates the situation in an anecdote. “When they built the college at Oxford, they used beams that were made from giant oak trees. When they went to refurbish them 500 years later, they could not find that type of beam any more. All those trees were gone from Europe, in terms of lumber yards anyway,” he says.
“The forester found out about this and said ‘but we have the trees’. It turns out that 500 years ago they had planted the trees for those very beams. That is the kind of thinking you don’t see any more.”
But then, in technology, since when hasn’t politics stood in the way of the greater good?
Read: The private life of data
Read: Win the generation game
Get quotes for Remote Data Backup