Test data deduplication tools with these five guidelines

Here is a list of five critical aspects to reflect on while testing data deduplication tools for actual results after your rollout.

Get insights into the top considerations while implementing data deduplication technology in the second part of this multi-part series on data deduplication.

Almost every system integrator is pushing for data deduplication tools in the customer environment. Customers are also willingly adopting these solutions due to their various benefits. Once deployed, the data deduplication tools can be tested using the following pointers:


  • The deduplication ratio in an environment primarily depends on data retention period, data change rate, and backup policy. Lower the data change rate, higher the deduplication ratio. Full backup policies also yield higher dedupe ratios and vice versa; hence, backup policies should be modified and full backups should be scheduled, instead of incremental ones.
  • Data source impacts the performance of data deduplication tools, as backup applications have different design structures, adding to performance overhead in the deduplication process. You can use utilities like DD or Xwrite to generate data to be fed to the data dedupe tools to test their performance. However, keep in mind that the numbers obtained through these applications are inflated, often as much as three times higher than backup applications.
  • Data set sizes also impact performance, as small data sets are easier to process and add little overhead to the system. Hence, while testing data deduplication tools, variable size data sets should be used to obtain nearly accurate performance results. Backup server hardware generally causes the data deduplication tool to run at sub-optimal rates; therefore, the hardware should be upgraded to obtain high dedupe throughput. 
  • Different vendors use different calculations for displaying dedupe ratios. Hence, ask your vendor as to how the dedupe ratios are calculated, as a vendor advertising 300:1 ratio may not really offer better deduplication than the one advertising 30:1. This is because the vendor advertising 300:1 may have analyzed the dedupe ratios over a short period of time and for data that changed to a small extent, during the duration. For instance, vendor A backs up 500 GB data on the first day and on the second day, backs the same data and the change is very less (say 2GB); thus, only 2 GB of data will be backed up. Hence, the inflated dedupe ratios shown by vendors are practically impossible in real environments.
  • You can also use the following generic method to perform dedupe calculations for data deduplication tools. This involves calculating the total amount of data sent to the storage system by backup or any other application to the system-wide deduplication ratio, that is, the sum of raw physical storage on the system, including user data, information about data, and spare capacity.


Dedupe ratio = Total data sent by the application to the storage system

                                   Total raw storage capacity of the storage system


For instance, if a system has 80 TB of physical storage (inclusive of metadata, user data, and spare capacity) and contains 800 TB of data sent to it by the backup or any other application, the deduplication ratio for the system is 800 TB divided by 80 TB or 10:1.  Thus, this dedupe technology gives user the advantage of storing 10 times more information on the same physical storage system.

To wrap up, the most important aspect to be considered while testing data deduplication tools for actual results in your environment is that dedupe ratios are directly related to the data being backed up. Hence, if you are backing up file server data, user data, email data, or virtual server environments, you may get very high dedupe ratios in the range of 90:1. In case of database backup, the dedupe ratios are comparatively lower, as the data redundancy factor is reduced.

Get insights into the top considerations while implementing data deduplication technology in the second part of this multi-part series on data deduplication.

About the author: Anuj Sharma is an EMC Certified and NetApp accredited professional. Sharma has experience in handling implementation projects related to SAN, NAS and BURA. One of his articles was published globally by EMC, and titled the Best of EMC Networker during last year's EMC World held at Orlando, US.

Read more on Disaster recovery

Data Center
Data Management