Until firms can decide which data to store, for what reason and
for how long, IT managers will face difficulties in implementing
data storage that is easy and cost effective.
The reason why information lifecycle management has suddenly become
the focus of the IT industry is that the amount of data we store is
growing at an alarming rate.
About five exabytes (five million terabytes) of new data is
produced globally every year and the rate of growth as predicted by
Hal Varian's researchers at the University of Berkeley in
California is about 30%.
Although some of this data is not stored by many organisations,
such as video and audio data, a great deal is - website content and
e-mail being the main culprits.
Regulation and compliance requirements in the US also mean that
companies will have to keep audit trails of changes to data - or at
least important data. This trailing is likely to multiply the
amount of transactional data we store by a factor of two or more.
Regulation and compliance issues in Europe are likely to follow
suit and present the same or similar demands.
Data copying
Storing this increasing amount of data in an intelligent way is
going to be problematic, but companies also need to consider that
the amount of data they store is also out of control.
Varian's team has estimated that 80% of stored data is replicated,
or redundant, or both. This means there are is an average of five
copies of every single chunk of data.
We can readily acknowledge that there should be two copies (the
real one and a back-up) and maybe there should be an average of
three, because data often needs to be distributed for the sake of
further usage or for the sake of performance. But two of the
average five copies are probably redundant.
It is not really surprising. It is common within organisations for
data to be left lying on disc somewhere because no one dares to
delete it, even though everyone is reasonably convinced it is not
required. Unfortunately, there is usually no accurate record of why
the data exists, although there is usually some way of knowing the
last time it was accessed and when it was created.
Let's add another factor to the mix: 90% of data on disc is seldom
accessed after 90 days; in fact, a good deal of it is never
accessed after a week. The 90% figure applies to all data, but the
data held in databases gets used for longer than data held in
files, and particularly data in e-mail systems.
Storage media
There are different options for data storage. Physically this can
include solid state disc, fast disc, capacity disc, optical disc,
near-line tape, far-line tape and non-digital means of storage -
the options getting less expensive according to the speed of
retrieval.
Unless you know the speed at which data needs to be made available,
it is not possible to organise a sensible flow of data from being
instantly available to an archived state. Back-ups are a natural
part of this migration as backed-up data also needs to be stored
and recovered at a specific speed.
To complicate the situation further, the price of the technology is
constantly changing. It is moving agreeably downwards, but the cost
equation is still complex. Digital tape, as a back-up media, is
gradually being replaced by disc, as disc is a far more reliable
medium and the cost per gigabyte is in steep decline. But this
needs to be balanced against the fact that most organisations store
ever more data. The cost of data storage is usually the most
expensive component of datacentre budgets, despite the decline in
costs.
The complexity of the situation suggests that the more automated
ILM becomes the more practical it will be. The ideal is to move
towards products that can monitor data growth and predict what type
of extra resource is required and when - and also optimise a cost
equation. This will depend on an analysis of the data resource and
the setting of policy in line with what is known.
Tackling ILM
There are no IT suppliers with complete out-of-the-box ILM products
yet. But the major storage companies such as EMC, IBM, Hitachi, and
StorageTek are all moving in the direction of getting smart about
the problem and treating storage as a "virtual" resource. EMC's
acquisition of Legato and Documentum last year will bolster its
presence in the ILM market.
The ILM problem will not be resolved by the storage suppliers
alone, but will ultimately involve the controlled versioning of all
data and the attaching of a much richer set of meta data (using
XML) to data itself - so that data of any kind has a record of who
created it, when, where, how and why, and also some indication of
its value. This is really the domain of databases, although they
are still far from being the natural store for all data.
If you are getting the idea that you will be hearing about ILM for
many years to come, you are probably right. We are only at the
beginning of its lifecycle.
Robin Bloor is chief executive of Bloor
Associates
Data management summary
About 40% of data is redundant and a high percentage does not
need to be online.
The intelligent archiving of data has started to become
imperative, because the cost of holding data in an archive is lower
than holding it on disc.
Data that is accessed after 90 days is important data and
keeping it available for quick access is vital. So analysing the
usage of data in a proactive manner to accurately estimate future
usage patterns is a key requirement
About 40% of data is redundant
Only 10% of data is accessed after 90 days
Users need to assess how quickly data needs to be accessed
Information lifecycle management has to be automated to be
really usable
The industry and users need to work on XML tagging to define the
value of data.