13 data deduplication optimization guidelines

Effective data deduplication is important for saving storage costs. Follow these 13 tips for data duplication optimization in your organization.

Almost every vendor providing data backup solutions now also offers data deduplication products, which may be either hardware- or software-based. Simultaneously, almost every organization is in the evaluation or deployment phase of data deduplication, hoping to save considerably on storage. We have put together 13 tips to get the most out of data deduplication solutions, and ensure data deduplication optimization for your organization.

To begin with, be aware that data deduplication ratios are heavily dependent on the data type involved — file data, database, and so on. Dedupe ratios will be higher for file data, as compared to database files when considering data deduplication optimization.

  1. For database files, target-based deduplication should be used, as the deduplication process could eat up the host’s CPU cycles, thus impacting application performance.
  2. For data deduplication optimization, alter your backup schedules and perform full backups more frequently in preference to incremental backups. This will increase the dedupe ratios, and also result in speedy recoveries.
  3. Almost every backup vendor offers an OpenStorage Technology (OST) plug-in designed for dedupe storage. Make use of it for data deduplication optimization to achieve high backup throughput, as well as high dedupe ratios.
  4. If you have deployed disk-to-tape staging in your environment, then increase the time duration for which the data resides on disk. This leads to increase in dedupe ratios over a period of time and ensures data deduplication optimization.
  5. In case you are using virtual tape libraries, disable multiplexing, as it adversely affects dedupe ratios. Instead, create additional virtual tape drives for better throughput. Multiplexing should be disabled at the software as well as the hardware level for data deduplication optimization.
  6. For backups that require encryption, deduplication is not recommended, due to lack of significant benefits. This is because it’s difficult to establish similar patterns within encrypted data.
  7. Software compression should be disabled for data deduplication optimization. The reason for this is that the deduplication storage systems do compress data after deduplication. Doing so prior to deduplication results in the deduplication system having to perform additional tasks during the deduplication process.
  8. File servers, remote servers, user desktops and laptops are the best candidates for deployment of source-based deduplication, where bandwidth is a limitation. For database servers, use target-based deduplication solutions.
  9. To achieve higher dedupe ratios and ensure data deduplication optimization, divide backup clients in batches, and start the backups one after another. This ensures that redundant data does not travel across the LAN, leading to optimal utilization of the LAN/WAN link. On the other hand, if the backups are triggered all at once, the data first travels across the LAN to the backup destination and then gets deduplicated.
  10. For tight backup windows, disable the deduplication process at the time of backups. Enable it after completion of backup to ensure data deduplication optimization.
  11. Every backup vendor has specific settings for achieving higher dedupe ratios, so read the vendor-specific operational documents to gain insights into the solution, and ensure data deduplication optimization.
  12. Many of the NAS vendors provide the feature of deduplication on the primary storage itself. Enabling deduplication on NAS storage systems can prove very beneficial because there are high chances of redundant data among many users, and performance is not that critical a factor for user shares. The process of deduplication could be scheduled in non-production hours to minimize performance degradation. 
  13. For source-based deduplication solutions, the host cache should be sized appropriately for data deduplication optimization. The default size of the cache is suitable for most applications, but in some cases these need to be fine-tuned if there are performance issues on the host.

About the author: Anuj Sharma is an EMC Certified and NetApp accredited professional. Sharma has experience in handling implementation projects related to SAN, NAS and BURA. He also has to his credit several research papers published globally on SAN and BURA technologies.

Read more on Disaster recovery