Data deduplication technology review

Most users know what data deduplication is. But they're asking lots of questions about how to best deploy the technology in their secondary storage environments.

What is data deduplication? 

Data deduplication reduces the amount of data that needs to be physically stored by eliminating redundant information and replacing subsequent iterations of it with a pointer to the original.

Data deduplication products inspect data down to block- and bit-level and, after the initial occurrence, only the changed data they find is saved. The rest is discarded and replaced with a pointer to the previously saved information. Block- and bit-level deduplication methods are able to achieve compression ratios of 20x to 60x, or even higher, under the right conditions.

There is also file-level deduplication, called single instance storage In file-level deduplication, if two files are identical, one copy of the file is kept while subsequent iterations are not. File-level deduplication is not as efficient as block- and bit-level storage because even a single changed bit results in a new copy of the whole file being stored. For the purposes of this Special Report, data deduplication is defined as operating at block and bit level.

What practical benefits does data deduplication have? 

Data deduplication's killer app is in backup. It demands too much processor power to be used in primary storage applications.

Data deduplication reduces the amount of data that has to be stored. This means that less media has to be bought and it takes longer to fill up disk and tape. Data can be backed up more quickly to disk, which means shorter backup windows and quicker restores. A reduction in the amount of space taken up in disk systems, VTLs for example, means longer retention periods are possible, bringing quicker restores to end users direct from disk and reducing dependence on tape and its management. Less data also means less bandwidth taken up, which means data deduplication can also speed up remote backup, replication and disaster recovery processes.

What deduplication ratios can be achieved? 

Deduplication ratios vary greatly, according to the type of data being processed and over what period. Data that contains lots of repeated information, such as databases or email, will bring the highest levels of deduplication, with in excess of 30 times, 40 times or 50x times deduplication ratios possible in the right circumstances. By the same token, data that contains lots of unique information, such as image files or financial ticker tape, will not contain a great deal of redundancy that can be eliminated.

What are the advantages of hardware-based deduplication versus software dedupe? 

Purpose-built deduplication appliances relieve the processing burden associated with software-based data deduplication products. The hardware-based deduplication offerings can also incorporate deduplication into other types of data protection hardware, such as backup appliances, VTLs and NAS.

While software-based deduplication usually eliminates redundancy in data at its source, hardware-based deduplication emphasises data reduction at the storage subsystem. For this reason, hardware-based deduplication may not bring the bandwidth savings that might be gained by deduplicating at source, but compression levels are generally better.

Hardware-based data deduplication brings high performance, scalability and relatively nondisruptive deployment. It is best suited to enterprise-class deployments rather than SME or remote office applications.

Software-based deduplication is typically less expensive to deploy than dedicated hardware and should require no significant changes to the physical network. But software-based deduplication can be more disruptive to install and more difficult to maintain. Lightweight agents are sometimes required on each host system to be backed up, allowing it to communicate with a backup server running the same software. The software will need updating as new versions become available or as each host's operating environment changes over time. Deduplication at the source is also processing-intensive so the host backup server must be configured for the job.

How does inline differ from post-process? 

Data deduplication can be carried out inline or post process. Inline (or in-band) data deduplication removes redundant data as it is being written to media. Inline can be more efficient because data is taken in and digested simultaneously, although the additional processing power needed to handle the process may extend the backup window. The advantage to the inline method is that data passes through only once, but because it is being processed as it does, it can slow throughput.

Post-process (or out-of-band) data deduplication is carried out after data has been written to disk. This method does not affect the backup window and sidesteps CPU processing that might create a bottleneck between the backup server and the storage. Post-process deduplication uses more disk space during the data deduplication process because data is ingested then deduplicated. Disk contention is another possible issue with disk performance potentially affected as users attempt to access storage during the deduplication process.

It is recommended that you not only test the different deduplication methods to determine how they work in your environment, but also test them against backups of differing size, data types and numbers of streams.

How do deduplication products eliminate redundant data?

Deduplication systems use a variety of methods to eliminate redundant data by inspecting data down to bit level and determining whether they have been stored before.

Hash-based algorithms

Hash-based methods of redundancy elimination process each piece of data using a hash algorithm, such as SHA-1 or MD5. This method generates a unique number for each piece of data which is compared to an index of other existing hash numbers. If that hash number already exists on the index, the data need not be stored again. Otherwise, the new hash number is added to the index and the data stored.

SHA-1 was originally devised to create cryptographic signatures for security applications. SHA-1 creates a 160-bit value that is statistically unique for each piece of data.

MD5 is a 128-bit hash that was also designed for cryptographic purposes.

Hash collisions occur when two different chunks produce the same hash. The chances of this are very slim indeed, but SHA-1 is considered the more secure of the two algorithms.

Bit-level comparison

The best way to compare two chunks of data is to perform a bit-level comparison on the two blocks. The cost involved in doing this is the I/O required to read and compare them.

Custom methods

Some vendors use custom methods to identify duplicate data, such as their own hash algorithm combined with other methods. For instance, Diligent and Sepaton use a custom method to identify redundancy and follow that with bit-level comparison.

What is the difference between source deduplication and target deduplication? 

Data can be deduplicated at the target or source. Deduplicating at the target means you can use your current backup software and the backup system operates as usual. The target identifies and eliminates redundant data sent by the backup system.

Deduplication at the source involves must installing backup client software from the deduplication vendor. The client communicates with a backup server running the same software and if the client and server agree that data has already been stored it is not sent, saving disk space and network bandwidth.

How does a deduplication device record the existence of redundant data?

Once a deduplication device has identified a redundant piece of data, it has to decide how to record its existence. There are two ways it can do so.

  1. Reverse referencing, which creates a pointer to the original instance of the data when additional identical pieces of data occur.
  2. Forward referencing, which writes the latest version of the piece of data to the system, then makes the previous occurrence a pointer to the most recent.

There are arguments that there is a difference in restore times possible between the two methods. For example, Sepaton claims its forward referencing method provides quicker restores.

How does encryption affect data deduplication 

Deduplication works by eliminating redundant files, blocks or bits, and encryption turns data into a data stream that is random by its nature. Therefore, if you encrypt data first -- that is, effectively randomise it and remove similar patterns -- it may be impossible to deduplicate it. So you may find that data should be deduplicated first and then encrypted.

Table: Data deduplication product review

Vendor h/w or s/w? VTL, NAS etc?
Algorithm used?
Inline or post-process?
Source or target?
Data Domain
See Exagrid
- - - -
VTL, NAS, SAN attached
SHA-1 and MD5
EMC/Avamar S/w
- SHA-1 and MD5
- Post-process
SHA-1 with optional MD5
See Avamar
- - - -
Hitachi Data Systems (HDS)
See Diligent and Exagrid - - - -
S/w (in OS)
Overland Storage
Pillar Data Systems
See Data Domain, Diligent, Falconstor, Symantec - - - -
Both VTL and NAS MD5 Both Target
S/w VTL Custom Post-process Target
Spectra Logic
See Falconstor - - - -
See Falconstor - - - -
Symantec S/w - SHA-1 Inline Source

Read more on Data protection, backup and archiving

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

Nice review of dedupe, thanks.