
Your data is only as good as the media it is stored on
and the format that it is stored in, writes Danny Bradbury
– and in the long term, both are transient at best
Archaeologists don’t just dig in the ground these days; they dig
in the sky too. Alexander Rose, executive director of
California-based think tank The Long Now Foundation, explains that
in its early days, Nasa sent signals back from satellites to
research centres that stored the data on tapes using custom-built
machines. Years later, archaeologists asked to examine the data to
see what had changed on the earth.
“Nasa thought they could get the data out, but it turns out they
did not have the machines to read it. Once they cobbled together
the machines, they found out the tapes had degraded to the point
where they could not get all the information off,” says Rose.
When it used error correction to help rebuild the damaged data,
the US space agency realised it did not know how it was organised,
and all the people who originally worked on the data had retired or
died.
Digital technology was meant to make us more information-rich,
but archivists such as Rose use the term “digital dark age” to
describe the early 2000s. There is more data produced and stored
than ever before, but storage media is volatile, and data storage
formats doubly so.
Load the DVD you burned last week, and the data will hopefully
still be there. But what about in five years? Or 10 years? Or
30?
“It depends on what environment they are being kept in,” says
Andy Maurice, head of consultancy at records management company
Iron Mountain. When CDs were launched in the early 1980s they were
lauded as permanent.
“We have seen recently that people with CDs have found some form
of bloom on them, which means they are decaying,” he says. When
optical media holds your corporate data, it could well last for
decades, especially in a clean-room environment, but it might not.
How much are you willing to risk?
There is a similar problem with magnetic media, says Brewster
Kahle, head of the Internet Archive, a project that tries to
preserve as much data from the web as possible.
“I cannot get floppy drives from 10 years ago to work, and hard
drives die too. You have to keep refreshing the media,” he
says.
Floppy discs and their cousins, such as high-density Zip drives,
were inherently vulnerable because the read head touched the
surface of the disc.
Even though hard drive heads float above the disc without
touching it, they are also vulnerable. Higher data rates can lead
to self-demagnetisation as minute magnetic cells flip their
polarity – something known as the super-paramagnetic effect. The
surface of the media can corrode, or particle contamination can
kill the drive. Adhesives used to hold components of the drive
together can also break down, leading to rubbing or contact with
the disc.
Some companies are taking a long-term view of the problem.
Iomega’s Rev system, introduced in 2004, is a removable disc sealed
in a cartridge containing the media and the spindle, with the rest
of the electronics in a separate reader. The disc and reader are
sealed to reduce contamination, giving the media what Robert Lutz,
worldwide product manager at Rev technology, claims is a 30-year
life.
The reader is unlikely to last as long as the media, he admits,
but adds that because the media uses standard 2.5in discs inside
the otherwise proprietary cartridge format, a savvy engineer could
cobble together a read/write device 20 years hence if Iomega was no
more and the Rev drive no longer existed. However, for your average
archivist wanting a perpetual storage mechanism without the hassle
of reconstructing hardware, that may be cold comfort.
While some people worry about media quality, Tony Dearsley
worries about media formats. As senior consultant on corporate
projects at computer forensics firm Vogon, he is often asked to
trawl through historical data.
“Often people have archived material but it is in a tape format
that they do not have the equipment to read it with any more,” he
says.
Tape drives are a problem because there are so many different
types of tape format. “Take a 3590 drive, which has variations B,
C, D and E. None of them are compatible with each other,” says
Dearsley. He maintains a collection of drives covering the past 25
years simply to read customer data.
Surely users can just keep old versions of media readers in
pristine condition themselves? Sure, says Maurice, just so long as
you maintain the correct version of the back-up application, the
operating system and the machine to run it on – and then replicate
that for each different media format you have.
One obvious way to solve the problem is frequent migration.
Migrating your data from one storage medium to another every few
years will help to ensure it stays current. Iron Mountain offers a
media transcription service for corporate data. Transferring data
between media helps with preservation but also increases space
efficiency, says Maurice. “Technology is moving so rapidly that
what you stored on 1,000 tapes five years ago can probably fit onto
10 modern format tapes.”
But replacement becomes more difficult and expensive as the
volume of data that a company stores increases. Marshalling the
storage and back-up of data in a single place can be a daunting
prospect when you move from gigabytes to terabytes and then into
petabytes. Companies already have vast amounts of data stored, but
with the ubiquitous storage of sensor and video data just a few
years away, the problem will become increasingly urgent.
Decentralisation – farming out the archiving process to the
organisations or divisions that create it – could be an interim
solution. The decentralisation discussion becomes increasingly
relevant as you begin dealing with companies whose remit is to
gather and preserve huge amounts of data from the public sphere,
such as archives and libraries.
“The problem we are facing is that by the time the large
organisations get some material that they are potentially
interested in, if it is an obsolete technology and it has no
documentation they cannot preserve it even if they want to,” says
Maggie Jones, former executive director of the Digital Preservation
Coalition, an organisation that promotes good practice for digital
archivists.
“We need to start spreading out that responsibility so more
people are responsibly creating digital information.”
For companies trying to preserve increasing volumes of data,
another likely strategy involves triage. Prioritising the data it
needs to store or discard helps a company manage the volume of
information residing on its drives.
Companies will need to refer to industry regulations when
determining this, and also refer to the data itself, assessing its
value to the business.
Assuming the physical storage format for a piece of data can be
preserved, there is then the problem of preserving the logical
format. The formats used to produce data by some of today’s
applications may be ubiquitous, but it is difficult to tell if they
will still be readable in 2020, or 2050.
“If you try to read a document written in Wordperfect 3.2, that
is probably very difficult, and that was only 10 years ago,” says
Kahle.
The key to preserving file formats for future use after the
applications are long dead is to use metadata, says Kevin Schurer,
director of the UK Data Archive, which works with the National
Archives to store social science data.
Storing metadata about the data format that you are preserving
enables you to understand it later on. It could give future
archivists the opportunity to create an application that will
consume it or transcribe it into a contemporary format.
“XML holds a lot of promise, in that it is flexible. It is just
rendering things with tag markups,” says Schurer, adding that the
UK Data Archive preserves its information using an XML document
schema developed under the Data Documentation Initiative, an
international effort to define a standard way of storing social
science data.
“We use that because it allows data and metadata to be tagged up
within a common file, so that you can preserve the metadata and
data alongside each other,” he adds. “Another advantage of an XML
type approach is that most XML files are rendered to an ASCII
character set, meaning they are easier to preserve.”
Using an agreed format is a wise move. Standardisation is an
important tool in any archivist’s arsenal, says Kahle.
When dealing with media such as videos and photos, for example,
it helps to have a published format that everyone uses, such as
MPeg or JPeg, with the encoding and decoding requirements publicly
available so future consumers can work out how to use them. “Even
proprietary formats supported by large companies go unsupported
after a while,” he says.
Sometimes, of course, the popular and the proprietary go hand in
hand. Microsoft has maintained an iron grip on the desktop market
for years, meaning that the majority of documents have been saved
in its proprietary binary formats. Its shift into native XML for
Office 2007 is a step in the right direction.
The company is trying to push the format through international
standards organisation ECMA for standardisation, but the rest of
the industry is working on the Open Document Format, another XML
standard for storing Office documents. Microsoft is not playing
ball with standards body Oasis to fold the Open Document Format
into Office.
The dispute over the formats has been so bitter that it led to
the resignation of Peter Quinn, CIO for the State of Massachusetts.
His decision to adopt the Open Document Format caused such adverse
reactions that observers felt he was put in an untenable
position.
Such rivalries may seem bitter, but they will fade into history,
leaving only a legacy of inconsistent record formats. In 200 years’
time archivists trying to read today’s documents will have more
unravelling to do.
And yet it is that vision, which thinks about the life of data
over hundreds of years rather than decades, that some of today’s
software and storage companies have yet to grasp.
Rose encapsulates the situation in an anecdote. “When they built
the college at Oxford, they used beams that were made from giant
oak trees. When they went to refurbish them 500 years later, they
could not find that type of beam any more. All those trees were
gone from Europe, in terms of lumber yards anyway,” he says.
“The forester found out about this and said ‘but we have the
trees’. It turns out that 500 years ago they had planted the trees
for those very beams. That is the kind of thinking you don’t see
any more.”
But then, in technology, since when hasn’t politics stood in the
way of the greater good?
Read:
The private life of data
Read:
Win the generation game
Get quotes for Remote Data
Backup