Simply put, data deduplication is the process of eliminating
redundant bits in a storage system. But as a market it is still
very much in the growing stage, a multitude of different approaches
by different vendors and their products can make investigating data
deduplication anything but simple.
Among the vendors there are two essential categories: those that
perform data deduplication "in-line" and those that perform it
"post-process." In-line data deduplication is performed as data
flows into the secondary storage system; post-process deduplication
is performed once data is already stored.
The advantage to in-line deduplication is that the process is
performed only once. At high enough capacities, some in-line
vendors argue that post-process deduplication can exceed backup
windows. However, the advantage to post-processing deduplication is
that there are no worries about the CPU-intensive deduplication
process creating a bottleneck between the backup server and the
secondary storage target.
In both cases, experts warn that users shouldn't be too cavalier
with disk purchases, especially not in the beginning. "A common
misunderstanding is that users will hear that they only need, say,
a terabyte to store 10 terabytes (TB) of backups," said W. Curtis
Preston, vice president of data protection services at GlassHouse
Technologies Inc. "Then they go out and buy a terabyte of disk,
only to realize that by definition they need 10 TB for the initial
backup," since it's only after that initial backup that bit-level
comparisons can be made.
The vendors
Beyond the in-line vs. post-process debate, there's no shortage
of differences -- and further debates to be had -- among different
vendors and their approaches to deduplication.
Data Domain Inc. has been shipping product longest and
has the largest install base at just over 750 customers. Its
appliances, which can be accessed through either a virtual tape
library (VTL) or network-attached storage (NAS) interface, range
from the branch office-sized DD410 model to the multipetabyte DDX
series array. Data Domain performs in-line deduplication and uses
the SHA-1 algorithm and a proprietary algorithm as a secondary
check. It keeps the comparison index cached in nonvolatile RAM.
With Data Domain, an individual data stream is limited to 110
megabytes per second (MBps). The company says it's working on
moving to a clustered architecture to aggregate performance, which
should be out next year.
Diligent Technologies Corp. offers data deduplication
within its ProtecTier VTL product, which is also resold by
Hitachi Data Systems (HDS). Diligent performs in-line
deduplication by keeping the comparison index in cache on Fibre
Channel disk, which it claims makes the process go faster, but
could also get expensive. Also in contrast to Data Domain, Diligent
uses a proprietary hashing algorithm throughout its deduplication
process. Diligent claims better performance numbers than Data
Domain, at 400 MBps throughput. Diligent and Data Domain largely
target different market segments -- Diligent at the high end and
Data Domain in the midrange. Diligent claims 150 customers.
Avamar, founded in 1999, was picked up by EMC
Corp. last year for $165 million. It was the first data
deduplication company to be acquired by a major vendor. Avamar also
performs data deduplication in-band using SHA-1, but does so at the
source (the backup server), rather than at the backup target. It
uses a central management node to keep track of data for comparison
over the whole environment, but does the deduplication in small
chunks at each server before it's sent over the network to the
backup target. As such, Avamar's deduplication can also reduce
network congestion in addition to reducing data at the secondary
storage target. Avamar's deduplication product requires the
replacement of the backup environment. EMC has stated plans to
incorporate it into its Legato portfolio and its VTL by next
year.
ExaGrid Systems Inc.'s post-process data deduplication
comes as part of its NAS backup appliance. Unlike other data
deduplication products, ExaGrid does comparisons at the byte level
rather than the bit level, claiming this makes for simpler hash
tables, better scalability and leaves less room for bit-level
fragmentation errors. ExaGrid's product is also "content aware,"
which means it understands the common data patterns in major backup
software products and can find duplicates accordingly.
FalconStor Software Corp.'s Single-Instance Repository
(SIR) feature on its VTL and IPStor product lines has yet to make a
full-fledged appearance on the market. The post-process product
uses the IPStor virtualization engine and the SHA-1 algorithm (with
a secondary check using the MD5 algorithm) to create a separate
deduplicated repository for long-term archive data after it is
backed up to the VTL. IBM and Sun Microsystems Inc.
both OEM the VTL product, though IBM does not offer SIR, and Sun
will not offer it until later this year.
Quantum Corp. folded in IP, acquired with Advanced
Digital Information Corp. (ADIC) last year, into the DXi3500
and DXi550 appliances in December. The in-line VTL-based
deduplication product uses a patented algorithm belonging to ADIC
subsidiary RockSoft. That deduplication has also recently been
added as feature within Quantum's StorNext filesystem, also from
the ADIC acquisition, which claims to be an all-in-one data
migration and management engine.
NEC Corp. of America, a subsidiary of NEC Corp.,Japan,
offers data deduplication as a feature within its HydraStor grid
backup appliance, released in March. HydraStor's proprietary
deduplication technology, dubbed DataRedux, eliminates data
duplication at the subfile level across and within incoming data
streams. With HydraStor's grid architecture, controllers are added
as capacity is added and every node is aware of every other node,
easing performance and management issues sometimes associated with
in-line products. NEC claims it reduces storage capacity by up to
75% without interrupting performance.
Network Appliance Inc. (NetApp) announced general
availability of block-level data deduplication within its NearStore
R200 and FAS storage systems on May 15 after beta testing it in
customer environments for the first quarter of this year. The data
deduplication development is based on NetApp's Advanced Single
Instance Storage (A-SIS), from its SnapLock product. NetApp used a
feature of its Write Anywhere File Layout (WAFL) to add A-SIS to
its filers. WAFL already calculates a 16-bit checksum for each
block of data it stores. For data deduplication, the hashes are
pulled into a database and "redundancy candidates" that look
similar are identified. Those blocks are then compared bit by bit,
and if they are identical, the new block is discarded. The license
key is free for NearStore users and will deduplicate data at the
block level on primary storage, which makes it unique among data
deduplication schemes. However, NetApp still has yet to add the
capability for its VTL, citing performance concerns.
Sepaton Inc. offers data deduplication on its S2100-ES2
VTL through a software option called DeltaStor. The post-process
deduplication uses a proprietary "content-aware" algorithm.
Sepaton's claim to fame so far in the data deduplication world is
the fact that it uses a process called forward referencing, while
other products use reverse referencing. Reverse referencing creates
a pointer to the original data if there are further occurrences of
the original; forward referencing writes the latest version of the
data and makes the previous occurrences a pointer to the most
recent version. Sepaton claims this method makes restores quicker
by keeping the most recent backups intact, since more recent
backups are the ones most likely to be restored as a general
rule.
Symantec Corp. has a product most comparable to Avamar, a
software feature called PureDisk it's currently integrating with
its NetBackup software. Like Avamar, the product uses a proprietary
algorithm to deduplicate data in-line and at the source. The latest
version of NetBackup, 6.2, supports PureDisk to tape targets and
integrates PureDisk into the Backup Reporter backup monitoring
tool. Version 6.2 also supports failover between multiple PureDisk
servers. The next big release for NetBackup, version 6.5, slated
for announcement in June, will offer even more integration between
NetBackup and PureDisk, according to early reports.