Data deduplication: The real benefits

With the exploding volumes of data, technologies that offer better utilisation of resources are always attractive. Data deduplication is one such technology that enables better utilisation of both storage devices and network bandwidth.

The savings can be achieved in various ways:
  • Equipment-related savings in disk purchases, savings in floor space and energy requirements and improved performance across the network
  • Application savings -- backup, email and other data management applications.
Data deduplication works by breaking data objects or data streams into "chunks" to determine the duplicate data. A unique identifier is calculated for each chunk using a hashing function -- MD5, SHA, for example. For each chunk, the identifier is compared against an index of identifiers to determine whether the chunk already exists in the data store. All chunks of data must be managed within the system; consequently, most systems have an underlying file system to store the data. There is metadata associated with each chunk that must be maintained, including how often the chunk is referenced, disk blocks, file name, etc.

Deduplication ratios can vary widely, depending on data streams, data volumes and data lifecycles. Metadata size in relation to data size must also be taken into consideration. This certainly impacts on the deduplication ratios, which can range from 4:1 to 200:1 and more. Performance is another consideration to calculate the hashing, update the metadata and store/locate the data. In all, consideration needs to be given to the scalability of the system and the time to reconstruct the data.

Considering data deduplication with backup systems, the typical gain on the first full backup is a 3-5 times compression. Incremental file backups can achieve compression of 5-8 times, and subsequent full backups can achieve ratios of 50-60 times compression. When this is aggregated, compression of 20 times or more (less than 5% of the original storage capacity) can be achieved. These gains are compared with traditional disk and tape storage.

Data deduplication can be completed at different points in the system architecture.

  • At the client: This reduces the load on the central server and reduces bandwidth requirements.
  • At the server: Data deduplication is usually completed in the data storage layer, and it operates transparently to the clients.
  • Block storage array: Here, it operates transparently to the clients and servers and is usually done for a selected set of volumes.
But there are other considerations when looking to reduce data storage capacities. Data compression can be used, but this would need to be completed after any deduplication in order to be effective. In addition, encryption must also be considered to secure large volumes of data. This can only be effective after compression, if all processes are to be used.

Deduplication with tape is also impractical since the location of data chunks must be available quickly and randomly. Yet, keeping all final data copies on tape is still appropriate for disaster recovery planning. This must also consider the metadata, since the data chunks are of little value without the underlying structure described by the metadata.

Also, how frequently data should be reconstructed from a deduplication process should be considered. For example, applying this technology to frequently accessed data will create overheads that could impact service levels.

Data deduplication is appearing in many forms. Virtual tape libraries, archive storage, disk storage systems, and applications such as email systems, content managers, backup systems and more, are examples of where data deduplication can be applied. It enables more data to be stored on disk or fast access devices.

Whether the need is to improve return on investment on systems or to address the environmental issues, data deduplication has a significant role to play.

About the author: Hamish E. Macarthur is the founder of Macarthur Stroud International, a UK-based IT consultancy. He is also co-author of "How to Market Computer and Office Systems", published by Macmillans, and has been a regular contributor to industry publications and a speaker at major conferences.

Read more on Data protection, backup and archiving