How to avoid a data deduplication fiasco

If a storage administrator doesn't have a solid understanding of data deduplication, there's a good chance the deployment of this technology will be a wasted one.

According to the Gartner Hype Cycle for hardware storage technologies, data deduplication is starting its downturn into disillusionment. But was it really ever at the top?

Coupled with virtual tape library (VTL) technology, deduplication was once billed as the solution to all backup spatial needs. Unfortunately, the market for deduplication hasn't taken off as expected, mainly due to two things:

  1. how the benefits have been communicated
  2. the basic understanding of the technology itself.
To successfully implement data deduplication, a storage administrator must understand the advantages as well as the disadvantages of the technology. In the case of data deduplication, there are three methods: host level, backup stream and post stream.

Host-level dedupe significantly reduces the amount of data pushed over the backup network, but requires that the dedupe takes place on the servers themselves. Most application servers are configured for significant processing, but obviously are dedicated to the applications they serve rather than backing them up.

Dedupe in the backup stream is the most prominent method of data reduction. Processing bits takes place at the dedupe agent after data has traversed the backup network. This can significantly impact the throughput between the backup network and the storage destination, although that destination is usually disk.

The third method is post backup stream. This is the least environmentally intensive dedupe method for backups, as it does its data processing only after the backup is complete. On the other hand, a significant amount of primary disk storage is required to handle the entirety of the backup stream before the data can be deduped to another storage destination. In some cases, that's a lot of disk.

The type of data being moved to that disk, and the decision on how long it's kept, play major roles in deciding on support technology. With the increasing risk of data loss, regulatory fines and even corporate espionage, encryption is becoming a hot technology.

Unfortunately, all encryption technologies that process data before or during the backup stream are incompatible with dedupe.
Brian Sakovitch
Senior ConsultantGlassHouse Technologies
Unfortunately, all encryption technologies that process data before or during the backup stream are incompatible with dedupe. A data stream that is never the same because it's been scrambled is not one that can be matched with a previous piece of the same data, and thus not deduped. Some VLT vendors are making forays into post dedupe encryption, but as it stands the most mature solution would be to encrypt at the tape drive level.

Another instance where dedupe becomes a wasted technology is with very short backup retention periods. Any retention that does not include at least two full backups means that the data being examined by lookup tables is always changing. This in turn means that there's rarely a match for the same bit stream, and therefore very little to dedupe. More disk is required to hold enough backup data to begin with, but with deduplication, that disk is used more efficiently.

Deduplication is a technology that requires a better understanding of both the technology itself, and how it will affect what's already deployed. There are several options for implementation; each has its pros and cons. Which of those pros and cons are appropriate benefits or risks for a particular backup environment needs to be determined before any kind of technology procurement.

Without foresight or understanding into how deduplication is going to be used in any given environment, it is no real mystery why the environment isn't seeing the astronomical benefits advertised by dedupe vendors.

About the author: Brian Sakovitch, senior consultant at Glasshouse Technologies (UK), has followed a 6-year path in backup technologies, ranging from hands-on installation and implementation, to design and theory. Three of those years have been with GlassHouse US, where he focused on predominantly backup-related engagements for companies of all sizes. Prior to joining Glasshouse, Brian held a role as a backup operations lead in a networking operations centre. Brian has a B.S. in Computer Science from Rensselaer Polytechnic Institute in Troy, N.Y.

Read more on Data protection, backup and archiving