How to avoid a data deduplication fiasco
If a storage administrator doesn't have a solid understanding of data deduplication, there's a good chance the deployment of this technology will be a wasted one.
Coupled with virtual tape library (VTL) technology, deduplication was once billed as the solution to all backup spatial needs. Unfortunately, the market for deduplication hasn't taken off as expected, mainly due to two things:
- how the benefits have been communicated
- the basic understanding of the technology itself.
Host-level dedupe significantly reduces the amount of data pushed over the backup network, but requires that the dedupe takes place on the servers themselves. Most application servers are configured for significant processing, but obviously are dedicated to the applications they serve rather than backing them up.
Dedupe in the backup stream is the most prominent method of data reduction. Processing bits takes place at the dedupe agent after data has traversed the backup network. This can significantly impact the throughput between the backup network and the storage destination, although that destination is usually disk.
The third method is post backup stream. This is the least environmentally intensive dedupe method for backups, as it does its data processing only after the backup is complete. On the other hand, a significant amount of primary disk storage is required to handle the entirety of the backup stream before the data can be deduped to another storage destination. In some cases, that's a lot of disk.
The type of data being moved to that disk, and the decision on how long it's kept, play major roles in deciding on support technology. With the increasing risk of data loss, regulatory fines and even corporate espionage, encryption is becoming a hot technology.
|
![]() |
||||||||||||||||
![]() |
Another instance where dedupe becomes a wasted technology is with very short backup retention periods. Any retention that does not include at least two full backups means that the data being examined by lookup tables is always changing. This in turn means that there's rarely a match for the same bit stream, and therefore very little to dedupe. More disk is required to hold enough backup data to begin with, but with deduplication, that disk is used more efficiently.
Deduplication is a technology that requires a better understanding of both the technology itself, and how it will affect what's already deployed. There are several options for implementation; each has its pros and cons. Which of those pros and cons are appropriate benefits or risks for a particular backup environment needs to be determined before any kind of technology procurement.
Without foresight or understanding into how deduplication is going to be used in any given environment, it is no real mystery why the environment isn't seeing the astronomical benefits advertised by dedupe vendors.
About the author: Brian Sakovitch, senior consultant at Glasshouse Technologies (UK), has followed a 6-year path in backup technologies, ranging from hands-on installation and implementation, to design and theory. Three of those years have been with GlassHouse US, where he focused on predominantly backup-related engagements for companies of all sizes. Prior to joining Glasshouse, Brian held a role as a backup operations lead in a networking operations centre. Brian has a B.S. in Computer Science from Rensselaer Polytechnic Institute in Troy, N.Y.