De-duplication in itself is easy to understand – optimised storage capacity usage by eliminating duplicated data. However the devil is in understanding the different technologies, techniques and implementations in the market and relating these to customers specific needs.
Instead of storing data multiple times, de-duplication enables the data to be stored once and uses that single instance as a reference. The techniques used to do this vary.
If we look at the first of these – working at a file level, rather than a byte level, there are well established techniques such as CAS – Content Addressable Storage. With this approach the contents of the file are put through a mathematical mincer and the end product is a unique identifier which is attached to the file. If exactly the same file exists somewhere else in the system, the mathematical mincer will produce exactly the same identifier – indicating a duplicate file which can be made into a single instance. The problem is that where files are constantly changing, the saving in storage capacity that can be achieved with CAS is fairly minimal. So why do we have it at all? The answer is – for archives.
With the second technique, byte level deduplication, the mathematical mincer changes, and this time it is looking for differences between files at a byte level. This is an effective approach to minimising the storage capacity consumed, but does not give change tracking such as CAS delivers.
On the face of it, this would go a long way towards saving expensive primary storage capacity. However, the reality is that in most primary storage environments the emphasis is on performance rather than saving disk capacity and any performance overhead are seen an inhibitor to speed of delivery. Additionally, the lifecycle of primary data can be fleeting (minutes or even seconds) so going through deduplication may be an unnecessary process. As a result, today, with a few evolving exceptions, byte level deduplication is aimed at the backup environment.
Another key option to consider is where in the data centre we implement deduplication? This doesn’t sound too important, but it is a raging argument among the vendors in this part of the industry.
Some approaches have implemented deduplication for backup with a software ‘agent’ loaded onto each application processor which undertakes backup. This spreads the load of the deduplication processing requirement across the processing power of all the servers involved – but crucially must interact correctly and effectively with the existing backup software packages loaded onto the servers. The upside of this deduplication implementation at source is that the process is completed before any data is sent to the storage devices, minimising the data transfers between server and storage. The downside, is that encountered by any agent based strategy, the agent must stay compatible with server software.
The alternative approach is to have a dedicated platform in the backup path which handle deduplication ‘on the fly’. This effectively centralises the process. The benefits here are that the platform, not the servers, delivers the processing power for the deduplication and because it requires no changes to the server software, it is effectively transparent to the user.
Whichever approach eventually becomes the dominant implementation, as the data deluge continues to accelerate, deduplication will rapidly become a core element of any data centre’s storage strategy.
David Galton-Fenzi is group sales director at Zycko