As the volume of data in an organization grows, the amount of
repeated data takes a toll on storage availability.
For example, a 10 MB PowerPoint presentation copied to 100 users
will require 1 GB of storage for the attachments on an Exchange
server. The problem gets worse when that 1 GB of duplicated storage
is backed up every week. After a year, that 1 GB of wasted space
can ultimately demand 52 GB on tape or other backup storage.
Data deduplication technology has emerged to
combat the problem of repetitious data. With data deduplication,
only one iteration of a file, block or byte is saved to the
actual storage media.
Data deduplication offers several benefits. Data deduplication
can achieve data reduction levels ranging from 10 to 1 to 50 to 1.
With less storage needed, storage costs are reduced, because this
means fewer disks and less frequent disk purchases. Less data also
means smaller backups, which translates into smaller backup windows
and faster recovery time objectives (RTO). The smaller backups also
allow for longer retention times on virtual tape libraries (VTL) or
archives.
But for deduplication to be effective, data must be held long
enough so that a comprehensive index of data develops to
deduplicate against. Deduplication is pointless with data that is
only kept for a week.
Deduplication essentials
Data deduplication (also called intelligent compression or
single-instance storage) scans data for repetitious content. At the
simplest level, this means locating multiple copies of the same
file. But deduplication only works for identical data, so two files
that differ by just a few bits will still be considered
different.
Today's data deduplication can go much deeper to locate
repetitious instances of blocks or bytes, thereby yielding greater
storage savings. When the data is actually moved to a backup,
archive or replication platform, only the first instance of that
data is committed to disk. Subsequent instances are simply denoted
with a small stub that references the saved iteration.
Each piece of deduplicated data is processed using a "hash
algorithm" such as MD5 or SHA-1, or sometimes a combination of the
two. This hash algorithm returns a designation that is unique to
each piece of data, and the hash is stored in an index. When
another piece of data is processed, its hash result is compared to
other indexed results. If the current result already exists in the
index, that piece of data is a duplicate, so the new data is not
saved. Instead, only a "stub" to the existing data is inserted.
Deploying deduplication
Data deduplication can be implemented as hardware appliances or
software products. Either implementation can take on various forms,
as vendors try to differentiate themselves in this emerging
marketplace.
Deduplication can be performed in-band, deduplicating data while
it's being written to storage. It can also be performed out-of-band
as a separate or secondary process. The in-band process can be more
efficient but may be slower because the additional processing
required at storage time could impact the backup window. The
out-of-band process won't impair performance, but will use slightly
more disk space and may cause some disk contention during
deduplication. Storage administrators should test several
deduplication approaches to determine how each works in their
particular environment.
Hardware-based implementations tend to be more expensive, but
typically perform better and are easier to deploy. Data Domain Inc.
offers a DD410 hardware appliance for branch offices and a DDX
series array. Quantum Corp. offers its DXi3500 and DXi550
appliances. When selecting a hardware appliance, be sure it's
compatible with your current backup software and that it will
support your current storage volume (e.g., covers up to 20
petabytes [PB]).
Deduplication is also built into several storage products,
including the ProtecTier VTL from Diligent Technologies Inc., the
network attached storage (NAS) backup appliance from ExaGrid
Systems Inc., the HydraStor grid backup appliance from NEC Corp. of
America, the NearStore R200 and FAS storage systems from Network
Appliance Inc. (NetApp) and the S2100-ES2 VTL from Sepaton Inc.
When deduplication is software-based, deduplication is generally
performed at the backup server (the source) rather than the backup
target (the storage system). This eases network congestion between
the backup server and storage system and can be handy when backing
up across a WAN. EMC Corp.'s Avamar product and Symantec Corp.'s
NetBackup offer software-based deduplication. Software-based
deduplication is often less expensive than hardware appliances but
involves the use of agents on each system to be backed up, which
can increase management/maintenance overhead for IT.
When considering deploying deduplication, scalability should be
a concern. You should understand how storage performance changes as
the data deduplication system grows. For example, very large hash
index tables may hurt performance. All deduplication vendors are
taking steps to address scaling performance issues.