You should consider the following three questions when choosing a data deduplication product. Once you have an idea about the critical success factors coming out of these questions, you can decide upon a data deduplication methodology and certain products can be eliminated.
Question 1. Do you want to rearchitect your backup and recovery infrastructure or integrate a dedupe product into your existing environment?
If you don't want to rearchitect your backup environment, you need to think about what deduplication functionality your data protection software has built in or which products will integrate well with your existing environment. However, you may have to compromise on factors such as functionality, manageability or scalability to find a product which integrates with your current environment. You must remember to keep a view on the product that will best suit your requirements and not get too hung up on the underlying technology.
On the other hand, if your backup and recovery environment is due for a technical refresh, you can instead think about the end result you want to achieve and select the overall environment which best meets your requirements.
Question 2. Are you more concerned with bandwidth savings for data backup or storage?
The answer to this question will most likely determine whether source deduplication or target deduplication best suits your needs.
Source deduplication reduces data on the backup client, while target deduplication reduces data inside the appliance used to store backups. Source deduplication requires backup software on the client and backup server. Target deduplication requires a disk repository, usually a virtual tape library (VTL) or network-attached storage (NAS) device.
Source deduplication is often used in the backup of remote offices where wide-area network (WAN) bandwidth is a constraint. Other areas of significant gain are in file system backups and virtual environments.
Target deduplication is found in most deduplication hardware products and is aimed primarily at reducing recovery data by eliminating redundant data after backups. Target backup sees the greatest gains when applied against storage-area network (SAN) environments, local-area network (LAN)-attached backups or large databases where bandwidth isn't such an issue.
Question 3. Are you interested in tape reduction or tape elimination?
Given the obvious difference in cost between tape and disk, if you have a requirement to keep data for long retention periods, for instance for legal or compliance reasons, then for the time being you're probably not going to get away from tape altogether. Therefore, you need to consider how the data deduplication product put in place will integrate with auxiliary tape copies.
Alternatively, you could look at some products on the market which integrate with tape devices and can facilitate tape copies directly from the deduplication unit. Often, the products which integrate best with tape are those which present themselves to the backup infrastructure as a VTL, as these already have data logically arranged onto virtual tape cartridges.
Reduction of tape comes from efficiencies in disk storage which allow you to retain a longer period of backup data before having to make an auxiliary copy of the data to tape. Rather than creating a tape copy every day, you would instead only create an auxiliary copy of backup data to meet long retention requirements.
Another consideration to be made when thinking about tape in the backup environment is that whilst some dedupe vendors allow the functionality to make deduplicated tape copies of backup data, this approach should be taken with caution.
Data deduplication relies on the removal of recurring blocks of data and providing pointers to back to the first iteration of that data. That first instance could have been captured some time ago, and therefore be located on an older tape. This means that to restore from tape and to piece together all the common data blocks, you may require parts of backup images from across a large number of tapes. This, in turn, means that tapes which contain blocks of data for active backup images can't be recycled as these tapes would be required in the event of a recovery.
If you're looking at tape elimination, then the replication functionality of the deduplication product is a key factor. Whether you're keeping data for one week or one year, you have to be able to get that data offsite so it's safeguarded in the event of a disaster.
Once you understand the impact of the above questions, you can begin to eliminate various products. When you have a shortlist of suitable ones, you can compare the functionality of inline data deduplication vs. post process, hash-based algorithms vs. content aware, software vs. hardware, and so on.
For more on data deduplication:
1. Learn about data deduplication for primary storage.
2. Discover the differences between source and target dedupe.
3. Learn how to make a business case for dedupe.
This was first published in March 2010