Online data deduplication calculators deliver unrealistic results

Free online calculators which promise to show you the results of data deduplication produce some dodgy results.

If you've considered data deduplication products, you may be familiar with the online calculators that purport to show dedupe ratios on the web sites of EMC's Data Domain, CommVault Systems, NetApp and others. Analysts say that while these calculators are more than mere marketing gimmicks, they're useful only to a point and fail to show which vendor's deduplication will give you the most bang for the buck.

The problem with the calculators, analysts say, is that they're too simplistic to take into account all variables involved with assessing data reduction ratios from a deduplication system.

"Any kind of calculator is going to attempt to take an abstraction," said Dave Russell, vice president for Gartner Research, with a specialty in storage and servers. "It's going to take a limited set of information, make some assumptions and then come up with a result, probably a result that's favorable to the technology that's being positioned. I'm not trying to suggest that deduplication doesn't offer a lot of benefits, but … it's not a simple model."

"Most of the [deduplication calculators] out there [give you] gross approximations," said Taneja Group senior analyst and consultant Brett Cooper.

Because the vendors take different variables into account with their calculators, it's impossible to make apples-to-apples comparisons between products. For instance, some calculators ask for values around data backup methodology, backup window, retention period and offsite storage while other calculators ask for a single value such as weekly backup size.

What factors determine data deduplication ratio?

So what factors actually should play into determining a data reduction ratio? The type of data you want to dedupe, redundancy level, size of the backup and retention periods all play critical roles.

Russell said different data types lend themselves to deduplication at different rates. "Is it file data? Is it email? Or is it application data?" he asked.

The redundancy level of data is another key factor. Russell said redundancy doesn't simply correlate to file types. Organizational habits also have an impact.

"[Say that] we've done a very, very good job, perhaps, of sharing files, not creating multiple copies, multiple versions. The level of redundancy that's inherent in the organization -- usually that comes down to how much file sharing [a company is] doing, [whether they're using] check-in/check-out types of procedures," he said. A company with that kind of profile would probably see lower deduplication ratios than a company that -- all other things being equal – doesn't do a lot of file sharing.

Next comes the size of the backup set, data growth rate and backup methodology. "You [need to] understand what the size of the backup set is that people are working with and how often they're doing their backups, whether [it's] full/incremental," Enterprise Strategy Group analyst Lauren Whitehouse said. "For example, if I have a full/incremental strategy, I've already taken off some percentage of deduplication [away] just by the virtue of not doing a full every day." And low data growth rates correlate to a lower level of duplicate data.

"The real level of duplication comes from the backup methodology itself," Russell said. "The more fulls, the more the chance for data reduction."

Retention periods are important in dedupe ratios because the longer the data is repeatedly backed up, the higher the amount of duplicate data. Russell suggests that most vendors' deduplication claims are based on an assumption of a three- to four-month retention period.

"If you happen to be an organization that says, 'We only retain that for 30 days, not 16 weeks,' right off the bat, that's at least 12 full backups that won't contribute to part of the dedupe calculation," he said.

Whitehouse said a company would need to retain data for at least 30 days to get the benefits of deduplication.

So determining dedupe ratios is a complex process, and each product takes a different approach to deduplication. But most vendor calculators we looked at asked for just a few pieces of data, rather than the full list above. And except for Data Domain, the vendors offered no explanation of the assumptions that go into the calculation. How, then, should an IT organization proceed in their evaluation of deduplication systems, short of hiring a consultant to analyze their data and determine what ratio they'll get?

"Use 10x," Whitehouse said. Instead of putting a lot of thought into the exact ratio an organization will get from a particular product, she recommends using the same 10x ratio for all products being considered.

But doesn't the one-size-fits-all ratio assumption mean that whatever competitive ratio differences there are among products' ratios would have to be ignored? Yes, said Whitehouse. "These calculators [are] not a way to compare one vendor versus another," she said. "[They're] basically just going to show you how much of a solution you're going to have to purchase."

Whitehouse added that ESG research has shown that cost, ease of use and ease of management are key factors in an organization's choice of deduplication system, more important than deduplication ratio.

"Now, the calculator is supposedly a tool to help people figure out what the cost of that solution might be. So it's important to understand some ballpark figure on the deduplication so you can say, 'With a 10 TB system, we think you can house all of [the] backups for 30 to 60 days and have room for additional growth.' Until you get a solution in house and you try it on real data with real policies around your data, you're really not going to understand the type of reduction ratio you're getting."

Read more on Data quality management and governance