Data deduplication rockets ahead

Now that every vendor has jumped on the data deduplication bandwagon, users must carefully figure out which product suits their environment.

Nearly halfway through 2007, storage managers have made up their minds on the merits of data deduplication technology.

"I wouldn't buy a secondary storage device today that doesn't have it," said Michael Thomas, storage architect with the Federal Reserve Bank, at a recent Storage Decisions conference.

It's easy to see why. The latest virtual tape libraries (VTL), which include data deduplication as a feature, claim to offer users as much as a 50-to-1 reduction in storage footprint by deduplicating redundant backup data. The savings in cost per gigabyte stored can be huge.

@37424 "With deduplication turned on, the economics of today's VTLs are comparable to tape," according to Robert Amatruda, analyst with IDC. Curtis Preston, vice president of data protection services at GlassHouse Technologies Inc., estimates the cost of a midrange tape library to be roughly $4 to $11 per gigabyte with disk prices hovering around $3 to $11 per gigabyte without compression or deduplication.

VTL providers estimate that with a retention period of one year for weekly full backups and 10 days for daily incremental backups, a single terabyte of data requires 53 terabytes (TB) of capacity for data protection over its life. With storage capacity growing at this rate, users are clamoring for any way to contain these costs.

Deduplication products have stepped in to help users curb this growth. The key suppliers include: Data Domain Inc., Diligent Technologies, ExaGrid, FalconStor Software Inc., Network Appliance Inc. (NetApp), NEC, Quantum Corp., Sepaton and Symantec Corp. EMC Corp. acquired Avamar Technologies and plans to incorporate its dedupe technology across its backup portfolio later this year. Hitachi Data Systems (HDS) has partnered with Diligent Technologies Corp. and IBM with NetApp.

"The merits of data deduplication are abundantly clear," said Arun Taneja, founder and consulting analyst with Taneja Group. However, he says the different methods of deduping data and the resulting reduction ratios are very fuzzy. Users should test the products thoroughly and with their own data sets, he warns, as vendors have found skillful ways to spin the numbers, none of which should be taken at face value.

Guna Shankar Selvaraj, IT infrastructure architect at Motorola Inc., says his company is evaluating Data Domain, but that they're in the "very early stages."

Similarly, the Federal Reserve Bank's Thomas says that he will test all the data deduplication products for six-to-eight months before committing to buying anything. "I want to know how many copies of the index [the product] will hold, and what happens if it gets corrupted … the integrity of that is very important," he said.

Another user concerned with recovering data after deduping is Richard Dearmon, enterprise storage architect with UIC Medical Center. "I want it, but it's not clear to me what happens to secondary and tertiary copies," he said. Across the board, users are eager to evaluate the technology, but still have lots of questions.

A few have already taken the plunge. CitiStreet, which keeps 50 TB of backup data on Sepaton's VTL, has seen a 56-to-1 reduction in its backup set using that product's deduplication technology. The firm has had the product in test for a couple of months and plans to move it into production by the end of July. There were some initial challenges with performance that CitiStreet was able to iron out with the help of Sepaton. " Their deduplication product is like a black box to the user -- they came in and flipped some switches, compressed some small files," and now it's working as advertised, according to Jeff Machols, vice president of global infrastructure at CitiStreet. With the reduction in data, CitiStreet is able to get more long-term retention online instead of worrying about tape storage. "We can keep at least a year's worth of data online now for backup and recovery," Machols said. "We don't have to worry about rotating to other storage."

Smoking guns

There are a couple of smoking guns that could slow down the adoption of deduplication products. Users are concerned about how deduplication, encryption and compression can all work together in a coordinated manner.

"Sometimes these features can be at cross purposes … it's important to figure out the profile of your data, as not all of it will deduplicate well," Motorola's Selvaraj said.

Another outlying concern is power consumption as more and more storage goes online. We talked with one user who was recently forced to turn off several Data Domain boxes because of power consumption issues. He requested anonymity because of the sensitivity of the topic.

"The product was working great … and then our facilities guy came in and said either you figure out what to turn off, or I'll have to start pulling plugs … we're out of power," the user said. The Data Domain gear was the last product in and first out of the data center. "We're back to tape for energy efficiency."

It's unclear at this stage how severely storage managers will be impacted by the recent energy crunch, but the problem appears to be filtering through to all departments in IT According to a recent Gartner report, "By 2008, 50% of current data centers will have insufficient power and cooling capacity to meet the demands of high-density equipment." Through 2009, Gartner says energy costs will emerge as the second highest operating cost in 70% of worldwide data center facilities.

Read more on IT risk management