Until firms can decide which data to store, for what reason and for how long, IT managers will face difficulties in implementing data storage that is easy and cost effective.
The reason why information lifecycle management has suddenly become the focus of the IT industry is that the amount of data we store is growing at an alarming rate.
About five exabytes (five million terabytes) of new data is produced globally every year and the rate of growth as predicted by Hal Varian's researchers at the University of Berkeley in California is about 30%.
Although some of this data is not stored by many organisations, such as video and audio data, a great deal is - website content and e-mail being the main culprits.
Regulation and compliance requirements in the US also mean that companies will have to keep audit trails of changes to data - or at least important data. This trailing is likely to multiply the amount of transactional data we store by a factor of two or more. Regulation and compliance issues in Europe are likely to follow suit and present the same or similar demands.
Storing this increasing amount of data in an intelligent way is going to be problematic, but companies also need to consider that the amount of data they store is also out of control.
Varian's team has estimated that 80% of stored data is replicated, or redundant, or both. This means there are is an average of five copies of every single chunk of data.
We can readily acknowledge that there should be two copies (the real one and a back-up) and maybe there should be an average of three, because data often needs to be distributed for the sake of further usage or for the sake of performance. But two of the average five copies are probably redundant.
It is not really surprising. It is common within organisations for data to be left lying on disc somewhere because no one dares to delete it, even though everyone is reasonably convinced it is not required. Unfortunately, there is usually no accurate record of why the data exists, although there is usually some way of knowing the last time it was accessed and when it was created.
Let's add another factor to the mix: 90% of data on disc is seldom accessed after 90 days; in fact, a good deal of it is never accessed after a week. The 90% figure applies to all data, but the data held in databases gets used for longer than data held in files, and particularly data in e-mail systems.
There are different options for data storage. Physically this can include solid state disc, fast disc, capacity disc, optical disc, near-line tape, far-line tape and non-digital means of storage - the options getting less expensive according to the speed of retrieval.
Unless you know the speed at which data needs to be made available, it is not possible to organise a sensible flow of data from being instantly available to an archived state. Back-ups are a natural part of this migration as backed-up data also needs to be stored and recovered at a specific speed.
To complicate the situation further, the price of the technology is constantly changing. It is moving agreeably downwards, but the cost equation is still complex. Digital tape, as a back-up media, is gradually being replaced by disc, as disc is a far more reliable medium and the cost per gigabyte is in steep decline. But this needs to be balanced against the fact that most organisations store ever more data. The cost of data storage is usually the most expensive component of datacentre budgets, despite the decline in costs.
The complexity of the situation suggests that the more automated ILM becomes the more practical it will be. The ideal is to move towards products that can monitor data growth and predict what type of extra resource is required and when - and also optimise a cost equation. This will depend on an analysis of the data resource and the setting of policy in line with what is known.
There are no IT suppliers with complete out-of-the-box ILM products yet. But the major storage companies such as EMC, IBM, Hitachi, and StorageTek are all moving in the direction of getting smart about the problem and treating storage as a "virtual" resource. EMC's acquisition of Legato and Documentum last year will bolster its presence in the ILM market.
The ILM problem will not be resolved by the storage suppliers alone, but will ultimately involve the controlled versioning of all data and the attaching of a much richer set of meta data (using XML) to data itself - so that data of any kind has a record of who created it, when, where, how and why, and also some indication of its value. This is really the domain of databases, although they are still far from being the natural store for all data.
If you are getting the idea that you will be hearing about ILM for many years to come, you are probably right. We are only at the beginning of its lifecycle.
Robin Bloor is chief executive of Bloor Associates
Data management summary
About 40% of data is redundant and a high percentage does not need to be online.
The intelligent archiving of data has started to become imperative, because the cost of holding data in an archive is lower than holding it on disc.
Data that is accessed after 90 days is important data and keeping it available for quick access is vital. So analysing the usage of data in a proactive manner to accurately estimate future usage patterns is a key requirement
About 40% of data is redundant
Only 10% of data is accessed after 90 days
Users need to assess how quickly data needs to be accessed
Information lifecycle management has to be automated to be really usable
The industry and users need to work on XML tagging to define the value of data.