Data deduplication: The real benefits
With the exploding volumes of data, technologies that offer better utilisation of resources are always attractive. Data deduplication is one such technology that enables better utilisation of both storage devices and network bandwidth.
- Equipment-related savings in disk purchases, savings in floor space and energy requirements and improved performance across the network
- Application savings -- backup, email and other data management applications.
Deduplication ratios can vary widely, depending on data streams, data volumes and data lifecycles. Metadata size in relation to data size must also be taken into consideration. This certainly impacts on the deduplication ratios, which can range from 4:1 to 200:1 and more. Performance is another consideration to calculate the hashing, update the metadata and store/locate the data. In all, consideration needs to be given to the scalability of the system and the time to reconstruct the data.
Considering data deduplication with backup systems, the typical gain on the first full backup is a 3-5 times compression. Incremental file backups can achieve compression of 5-8 times, and subsequent full backups can achieve ratios of 50-60 times compression. When this is aggregated, compression of 20 times or more (less than 5% of the original storage capacity) can be achieved. These gains are compared with traditional disk and tape storage.
Data deduplication can be completed at different points in the system architecture.
- At the client: This reduces the load on the central server and reduces bandwidth requirements.
- At the server: Data deduplication is usually completed in the data storage layer, and it operates transparently to the clients.
- Block storage array: Here, it operates transparently to the clients and servers and is usually done for a selected set of volumes.
Deduplication with tape is also impractical since the location of data chunks must be available quickly and randomly. Yet, keeping all final data copies on tape is still appropriate for disaster recovery planning. This must also consider the metadata, since the data chunks are of little value without the underlying structure described by the metadata.
Also, how frequently data should be reconstructed from a deduplication process should be considered. For example, applying this technology to frequently accessed data will create overheads that could impact service levels.
Data deduplication is appearing in many forms. Virtual tape libraries, archive storage, disk storage systems, and applications such as email systems, content managers, backup systems and more, are examples of where data deduplication can be applied. It enables more data to be stored on disk or fast access devices.
Whether the need is to improve return on investment on systems or to address the environmental issues, data deduplication has a significant role to play.
About the author: Hamish E. Macarthur is the founder of Macarthur Stroud International, a UK-based IT consultancy. He is also co-author of "How to Market Computer and Office Systems", published by Macmillans, and has been a regular contributor to industry publications and a speaker at major conferences.