The effectiveness of data deduplication is often expressed as a deduplication or reduction ratio, denoting the ratio of protected capacity to the actual physical capacity stored. A 10:1 ratio means that 10 times more data is protected than the physical space required to store it, and a 20:1 ratio means that 20 times more data can be protected. Factoring in data growth, retention and assuming deduplication ratios in the 20:1 range, 2 TB of storage capacity could protect up to 40 TB of retained backup data.
How are these data deduplication ratios determined? The rate is calculated by taking the total capacity of data to back up (i.e., the data that will be examined for duplicates) and dividing it by the actual capacity used (i.e., the deduplicated amount of data).
What's a realistic data dedupe ratio?
But what is a realistic data deduplication ratio? The Enterprise Strategy Group (ESG) research found that, of respondents currently using data deduplication technology, approximately one-third (33%) said they have experienced a less than 10 times reduction in capacity requirements; 48% report a 10 times to 20 times reduction, and 18% report reductions ranging from 21 times to more than 100 times.
Several factors influence deduplication ratios, including:
- Data backup policies: the greater the frequency of "full" backups (versus "incremental" or "differential" backups), the higher the deduplication potential since data will be redundant from day to day.
- Data retention settings: the longer data is retained on disk, the greater the opportunity for the deduplication engine to find redundancy.
- Data type: some data is inherently more prone to duplicates than others. It's more reasonable to expect higher deduplication ratios if the environment contains primarily Windows servers with similar files, or VMware virtual machines.
- Rate of change: the smaller the rate of change, the higher the likelihood of finding duplicate data.
- Deduplication domain: the wider the scope of the inspection and comparison process, the higher the likelihood of detecting duplicates. Local deduplication refers to the examination of redundancy at the local resource, while global deduplication refers to inspecting data across multiple sources to locate and eliminate duplicates. For example, a daily full backup of data changing at a rate of 1% or less that is retained for 30 backups has 99% of every backup duplicated. After 30 days, the ratio could reach 30:1. If, on the other hand, weekly backups were retained for a month, then the ratio would reach only 4:1.
Deduplication rates can be confusing. Some vendors express reduction as a percentage of savings instead of a ratio. If a vendor cites a 50% capacity savings, it's equivalent to a 2:1 deduplication ratio. A ratio of 10:1 is the same as 90% savings. That means that 10 TB of data can be backed up to 1 TB of physical storage capacity. A 20:1 ratio increases the savings by only 5% (to 95%).
Evaulating a dedupe product
When evaluating data deduplication, it's important to trial vendors' products in your environment with your own data over several backup cycles to determine a product's impact on your backup/recovery environment. The focus of selecting a product should be less on reduction ratios as a decision factor. ESG research (ESG Research Report, "Data Protection Market Trends," January 2008) found that, not surprisingly, the cost of the deduplication solution was the most frequently cited factor (although savings garnered from capacity reduction often overcome financial objections to deploying deduplication). Otherwise, the survey data suggests that ease of deployment and ease of use, as well as the impact on backup/recovery performance were important considerations -- more so than technical implementations, such as the deduplication ratio.