Data deduplication, also called intelligent compression or
single-instance storage, is a means of reducing the amount of data
that needs to be stored. The data deduplication process works by
eliminating redundant data and ensuring that only the first unique
instance of any data is actually retained. Subsequent iterations of
the data are replaced with a pointer to the original.
Data deduplication can operate at the file, block or bit level.
In file-level deduplication, if two files are exactly alike, one
copy of the file is stored and subsequent iterations receive
pointers to the saved file. However, file deduplication is not
highly efficient because the change of even a single bit results in
a totally different copy of the entire file being stored.
In block deduplication and bit deduplication, the software looks
within a file and saves unique iterations of each block. If a file
is updated, only the changed data is saved. This is a far more
efficient process than file-level deduplication. Block
deduplication and bit deduplication can achieve compression ratios
ranging from 10: 1 to 50:1.
Under the data deduplication hood
Each "chunk" of data (e.g., a file, block or bits) is processed
using a hash algorithm, such as MD5 or SHA-1, generating a unique
number for each piece. The resulting hash number is then compared
to an index of other existing hash numbers. If that hash number is
already in the index, the data does not need to be stored again.
Otherwise, the new hash number is added to the index and the new
data is stored.
The more granular a deduplication platform is, the larger an
index will become. For example, file-based deduplication may handle
an index of millions, or even tens of millions, of unique hash
numbers. Block-based deduplication will involve many more unique
pieces of data, often numbering into the billions. Such granular
deduplication demands more processing power to accommodate the
larger index. This can impair performance as the index scales
unless the hardware is designed to accommodate the index
properly.
In rare cases, the hash algorithm may produce the same hash
number for two different chunks of data. When such a
hash collision occurs, the system fails to store the new data
because it sees that hash number already. Such a "false positive"
can result in data loss. Some vendors combine hash algorithms to
reduce the possibility of a hash collision. Other vendors are
examining metadata to identify data, thereby preventing hash
collisions.
Other forms of data reduction
Data deduplication is typically used in conjunction with other
forms of data reduction, such as
compression and
delta differencing. In data compression technology, which has
existed for about three decades, algorithms are applied to data in
order to simplify large or repetitious parts of a file.
Delta differencing reduces the total volume of stored data by
saving only the changes to a file since its initial backup. For
example, a file set may contain 200 GB of data, but if only 50 MB
of data has changed since the previous backup, then only that 50 MB
is saved. Delta differencing is frequently used in WAN-based
backups to make the most of available bandwidth in order to
minimize the backup window.
Faster backups and recovery times
With data deduplication, at an effective compression ratio of
30:1, 300 GB could be stored on 10 GB of disk space. It's easy to
see how this can lead to big savings, since not only do fewer disks
need to be purchased, but disks also take longer to fill.
Data deduplication also has collateral benefits. Less data can
be backed up faster, resulting in smaller backup windows, smaller
(more recent) recovery point objectives (RPOs) and faster recovery
time objectives (RTOs). Disk archive platforms are able to store
considerably more files. If tape is the ultimate backup target,
smaller backups also use fewer tapes, resulting in lower media
costs and fewer tape library slots being used.
For a virtual tape library (VTL), the reduction in disk space
requirements translates into longer retention periods for backups
within the VTL itself. For example, an ordinary VTL might save
backups for 30 days, then offload the oldest backup to tape in
order to free up disk space for subsequent backups. Since, with
data deduplication, the effective disk space can be dramatically
expanded, a VTL might be able to hold two year's worth of backups,
significantly reducing dependence on tape systems.
Data deduplication also speeds up remote backup, replication and
disaster recovery processes. Data transfers are accomplished
sooner, freeing the network for other tasks, allowing additional
data to be transferred or reducing costs through the use of slower,
less-expensive WANs.