When and where should you dedupe?

Data deduplication has clear benefits. But to realise the technology's goodness, you need to plan when and where to dedupe - and whether hardware or software is best-suited to the job.

The impact that data growth is having on backup windows is driving more organizations to implement disk-to-disk backup. This has created tremendous interest in data deduplication because the capacity optimization resulting from deduplication means that data can be retained longer on disk, which increases the likelihood of a disk-based recovery vs. a slower, manual, tape-based recovery.

While deduplication has been a feature of several backup offerings for years, the technology has been most widely adopted in backup hardware, such as virtual tape libraries (VTLs) and network-attached storage (NAS)-based disk targets. Meanwhile, deduplication implementations in backup software require organizations to switch out legacy solutions, which the hardware-based deduplication vendors have made sure to point out isn't always a desirable path. Now that mainstream backup software vendors such as CommVault, EMC Corp., IBM Corp. and Symantec Corp. are incorporating data deduplication into their backup products (reducing the amount of disruption caused by implementing deduplication), the question is being asked again: Where does deduplication belong in backup?

Software-based deduplication

Software-based approaches are differentiated in a few ways. First, they have knowledge about the data in the backup stream; they can look at patterns in the data stream (the bytes that make up a file) and determine the optimal segment boundaries, which maximizes the likelihood of identifying duplicates. In short, backup software understands the content, whereas target-side deduplication solutions typically don't. Targets simply receive a "blob" of data from the backup application. Those target-side deduplication devices that are content-aware typically have to extract the meta data associated with the backup and "reverse engineer" the backup stream to understand its contents.

Second, integration with the backup software allows for policy-based deduplication. Deduplication can be disabled for selected data sets where it doesn't make sense to turn it on (such as an MRI image) or for other data types (like databases) where you don't want to interfere with performance.

One of the drawbacks of a software-based approach is that adopting a deduplication feature could require an upgrade in backup application and/or client agents. Another factor is that deduplication may be processor-intensive and, when performed at the source application server, it may compete with and slow down apps. The scalability and performance of the media server performing deduplication could also be limiting factors. It will be important to investigate the upper limits of deduplication "pools" and performance capabilities for large volumes of data.

Hardware-based deduplication

Hardware-based deduplication is less disruptive; that is, it's seamless to deploy because it's compatible with any backup software and can be implemented quickly and easily. It typically leverages powerful, purpose-built storage appliances to accommodate processing of the entire (non-deduplicated) backup load either pre- or post-ingestion. Hardware-based solutions also have the advantage of processing data streams from multiple backup applications.

There are a few trade-offs to consider. More data than may be necessary traverses the network between the source system and target device (creating unnecessary congestion), as deduplication happens at the end of the data path. Depending on the solution, scalability could be another drawback. Some vendors are limited to single-node systems, which can result in multiple islands of deduplication and points of management, as well as underutilization in capacity per silo. Data streamed to a single-node system is only compared with other data directed to the node.

The goal of many target-side deduplication vendors is to deduplicate across clustered nodes. Global dedupe allows backup data to be deduplicated against all other backup data, regardless of which head actually receives the data. This capability is seen more often in software-based and grid architecture approaches, but may also be supported for target deduplication systems that replicate in a hub-and-spoke fashion (with global deduplication occurring at the hub). Global deduplication can result in higher deduplication ratios -- as data is deduplicated within and across backup sources -- and greater economies of scale with respect to operational overhead and capital costs.

Cost is a factor

Enterprise Strategy Group research has found that organizations are just as likely to purchase and implement data deduplication technology from backup software vendors as they are from disk/appliance hardware vendors. The top considerations when evaluating and selecting a data deduplication provider are cost, ease of integration, performance, ease of use and scalability, with cost clearly outranking the others. Now that deduplication is becoming a mainstream feature integrated in backup software, it will be interesting to see if "bolt on" deduplication systems can maintain their premium price.

As with any new technology, it will be important for IT organizations to evaluate software- and hardware-based approaches vs. the requirements of the environment. Having a clear understandin



Read more on Data quality management and governance

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.