Large firms clamour for data deduplication

Users at Storage Decisions from large companies say they need to reduce backup data that's out of control, but have had trouble finding products that fit their requirements.

NEW YORK, NY -- Data deduplication is a hot topic at this year's Storage Decisions conference, with users saying they're gung-ho about deploying the technology. However, those with large storage environments say they've had trouble finding a product that fits their requirements.

Brian Greenberg, director of data protection services for a large financial company based in Chicago called data deduplication the "Holy Grail" of disk-based backup Wednesday during a presentation on disk-based backup.

Still, Greenberg's company, which he declined to name, is sticking to tape for backup for now while waiting for deduplication to become more useful for disaster recovery.

More on data deduplication
EMC repackages Avamar data deduplication

Data Domain's CEO takes on nearline storage

Quantum first with data deduplication flexibility

Users dish on Symantec PureDisk
A cost analysis model Greenberg performed using systems-analysis software called iThink from Isee Systems showed that with a three-year retention scheme, the cost of media for about 68,000 tapes over the next five years would amount to $3.4 million. The cost of disk capacity for the same amount of data, not including power and cooling, comes out to $103 million -- and twice that amount for replication. However, he said, data deduplication at a ratio of 30:1 brought the disk costs down to about $3.2 million. "Data deduplication is the key to being able to do disk-based backup in our environment," he said.

So why isn't he using it? Greenberg said he will not deploy a data deduplication appliance until he finds one that can copy its deduped data store and its index to tape for disaster recovery purposes. He could copy data from most data deduplication systems to tape by "rehydrating" the data and backing up the same data separately, but Greenberg said he wants to save space on tape, too. "Being able to backup the catalog is a standard feature of a tape backup environment," he said. "Many of the vendors have asked me why I'd want to do tape backup when I can replicate between systems, but what if there's a rolling disaster that corrupts both?"

Pete Fischer, storage administrator for a large paper and packaging manufacturing company based in the South, said his company is desperate to find a product that can reduce the 400 TB of data it must protect every 24 hours. The company uses IBM's Tivoli Storage Manager (TSM) to send data from EMC Clariion CX500, 600 and 700 systems with a total of 27 TB usable capacity to Clariion Disk Library (CDL) virtual tape library (VTL) systems.

"We have barely enough room to keep our incremental backup data in the disk pool," Fischer said. Any overflow gets sent directly to the CDLs, which are also trying to backup data from the disk pool, causing bottlenecks. Fischer also said he's running out of capacity in his tape libraries, estimating that a fully populated Sun StorageTek SL8500 has about 30 percent of the drives he needs.

Fischer's company has brought in a Data Domain box for testing. He's also evaluating Diligent Technologies, but favors Data Domain because Diligent is strictly a VTL. "We're leery of VTL and tape in general at this point," Fischer said. His firm is putting Data Domain DD560 systems through rigorous performance testing, and Fischer said he's not satisfied with the product's scalability. The DD560s hold just over 1 TB of disk apiece, so he will need to deploy at least eight boxes and silo his data according to application. "What I want is to have the boxes be aware of each other, and to be able to get even more data reduction across applications," he said.

Mark Glazerman, storage and backup admin for a plastics manufacturing company, is happily running Data Domain DD560 and DD430 boxes to back up 25 TB. Glazerman said his most recent monitoring reports from his Data Domain systems show an average throughput of 10 MBps over 24 hours. That satisfies Glazerman, but won't work for everybody. [Update: Following publication of this article, Glazerman contacted to clarify that the 10 MBps throughput rate reported by the system is per drive, rather than for his entire system. At 15 drives, the entire system is getting an average throughput of 130 MBps, Glazerman said.]

Jannes Kleveberg, solution area manager for ATEA, a consulting firm that manages storage at a large automobile manufacturer's facilities in Europe, has considered deduplication for his client's 600 TB shop. He heard Glazerman's per-drive performance numbers with Data Domain and said "that kind of performance won't do in a large environment."

Kleveberg said he's concerned about post-process systems causing contention with the servers they draw data from after the backup window is over. "For us it always comes back to the performance issue," Kleveberg said.

Data Domain's director of product management Ed Reidenbach said users may point fingers at deduplication if they have poor performance because it's an unfamiliar technology. "We spend a lot of time debugging customer networks to resolve the issue, but since we're the new player in the environment [users] think we're the problem," he said. According to vice president of marketing Beth White, Data Domain is working on letting individual boxes connect through a global namespace to scale better. "We're still pushing the upper limits of our product," she said. "All of us [vendors] in this market are still working our way up the food chain to those megascale data center environments."

Read more on Business applications