News Analysis

Users discuss data deduplication doubts

Beth Pariseau, News Writer
LAS VEGAS -- Data deduplication is the hot technology of the moment, and several users at the west coast Storage Decisions conference said Wednesday that they were investigating deduplication, but so far, they're not buying it.

"We've passed by the last two or three storage fads, including SRM and virtualization," said Brian Peterson, a storage architect for ConAgra Foods, Inc. in Omaha, Neb. "It's always interesting, but we prefer to wait out the 'curve of disillusionment'."

Data deduplication information
Data Domain pushes on with deduplication

Symantec fires a shot across the bow of VTLs

EMC drops $165 million on dedupe firm Avamar

NetApp adds compression to VTL, lacks dedupe

Among the disillusionments with data deduplication products in general, according to experts, is performance. Data Domain's boxes have a 100 MBps throughput, and though they can be aggregated together under one management console, the actual boxes aren't sharing data on the back end. Dilligent advertises a throughput rate of 220 MBps on its ProtecTier virtual tape library (VTL).

Compared with tape libraries, and VTLs, which top out around 500 GBps, that's dismal performance, according to Curtis Preston, vice president of data protection services for GlassHouse Technologies Inc. "100 megabytes per second is not up to par for an enterprise backup target, though it can be fine for a remote office," Preston said. He added that a single stream within deduplication boxes like Diligent's and Data Domain's that perform deduplication "inline" -- before data is written to disk -- can be as low as 40 MBps.

"Our goal for backup throughput is 1 terabyte per hour," said Peterson. He said ConAgra does full daily backups of critical databases to the tune of 20 to 30 TB per day, divided up against several VTLs (Peterson declined to name the vendor). "We do about 450 MBps in each stripe across the VTLs. Anything that wants to fit into our backup environment has to perform at SAN speeds."

Another user, David Silvestri, who asked that his company not be named, said he was still trying to figure out where data deduplication would fit into his environment from the opposite angle -- in some cases, his remote offices are too small for it to be practical, he said. "For 500 GB of data, there's no point to having a Data Domain and backup software installation at my remote site to back it up," he said. "There's a possibility that at least 15 of my remote sites are large enough to exist on their own, but not to support a Data Domain box." For those sites, Silvestri said, he was evaluating WAFS from Riverbed Technology Inc.

Another concern until recently, Silvestri said, was Data Domain's product features. "We looked at them last year, when they had just come out, and we weren't ready to jump on board right away -- but now they have a lot of installations and built in replication, and that's a lot more appealing."

Concerns over data integrity

Users are also apprehensive about introducing another product into their environment that touches the data and could cause corruption. "Most of our growth is in databases, critical databases where we'd be concerned about anything in the background manipulating the data," said Jim Norris, technical specialist with Worldspan LP, a provider of travel technology services for travel suppliers, travel agencies, and e-commerce sites.

According to Norris, the company backs up 2 TB per day, 1.5 TB of it generated by databases. "It's our impression that any data deduplication product should be aware of the backup software and vice versa," Norris said. "For example, what if Tivoli Storage Manager (TSM) sees the stub and backs up a file twice?"

Norris added, "DBAs want to keep multiple copies of the data -- in fact, there are policies and regulations that dictate that there have to be full copies of the database, not even snapshots, for data protection purposes."

"I have two fears [about deduplication]," ConAgra's Peterson said. "One is that there's software keeping you from your data -- and if that software breaks, you lose all your data with it."

The other, he said, is that condensing the number of copies of data on the back end makes the remaining single instance of the data that much more crucial -- and in his mind, that much more vulnerable. "It's like an anvil with all its weight on one small point," he said. "If the dedupe product goes down, you don't just lose one backup -- you lose them all."

A lack of evidence

According to Preston, there is another school of thought around data deduplication, undertaken by Sepaton and FalconStor, which deduplicates data "post-process" or after it's written to disk. This can solve the issue of performance during writes to disk, but can also at least theoretically pose a problem when deduplication is performed in the background while other backups are being written.

"It all comes down to the implementation," Preston said -- and publicly discussed implementations of these products are still relatively rare. "It's not even that the jury's out -- basically, we haven't even seen the evidence yet."

Another point being raised in the dedupe wars is the issue of forward vs. reverse referencing. Most of the dedupe players, Preston said, do reverse referencing, which looks at a block up front, compares it against data already in the backup repository, and either keeps it or doesn't depending on whether it matches. But forward referencing -- which currently is only done by Sepaton -- saves the most recent blocks and then compares it against older data, a process Sepaton says improves restore times, since the most recent backup version is kept freshest in its VTL.

But again, Preston said, "It remains to be seen if that is true."

"I haven't really done enough research yet on Avamar," ConAgra's Peterson said. "I find the idea of not sending duplicate data over the wire appealing, but I haven't really dug into it yet."

"I'm not sure how EMC is going to use [Avamar]," said Mk Mokel, an EMC user who asked that his company, a bank based on the East Coast, not be named for legal reasons. "Whether or not we implement it depends on if they leave it as a standalone product or as a modular feature we can just add in."

Data Domain officials confirmed the 100 MBps performance figure and said performance updates were on the way in the next year; Diligent reps said adding clustered servers to the ProtecTier VTL to boost performance is on the product's roadmap.


Email Alerts

Register now to receive ComputerWeekly.com IT-related news, guides and more, delivered to your inbox.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
 

COMMENTS powered by Disqus  //  Commenting policy