How data deduplication works

Chapter one of our data deduplication handbook takes you under the hood to show how data deduplication works so you can maximise the efficiency of this data reduction technology.

Data deduplication, also called intelligent compression or single-instance storage, is a means of reducing the amount of data that needs to be stored. The data deduplication process works by eliminating redundant data and ensuring that only the first unique instance of any data is actually retained. Subsequent iterations of the data are replaced with a pointer to the original.

Data deduplication can operate at the file, block or bit level. In file-level deduplication, if two files are exactly alike, one copy of the file is stored and subsequent iterations receive pointers to the saved file. However, file deduplication is not highly efficient because the change of even a single bit results in a totally different copy of the entire file being stored.

In block deduplication and bit deduplication, the software looks within a file and saves unique iterations of each block. If a file is updated, only the changed data is saved. This is a far more efficient process than file-level deduplication. Block deduplication and bit deduplication can achieve compression ratios ranging from 10: 1 to 50:1.

Under the data deduplication hood

Each "chunk" of data (e.g., a file, block or bits) is processed using a hash algorithm, such as MD5 or SHA-1, generating a unique number for each piece. The resulting hash number is then compared to an index of other existing hash numbers. If that hash number is already in the index, the data does not need to be stored again. Otherwise, the new hash number is added to the index and the new data is stored.

The more granular a deduplication platform is, the larger an index will become. For example, file-based deduplication may handle an index of millions, or even tens of millions, of unique hash numbers. Block-based deduplication will involve many more unique pieces of data, often numbering into the billions. Such granular deduplication demands more processing power to accommodate the larger index. This can impair performance as the index scales unless the hardware is designed to accommodate the index properly.

In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When such a hash collision occurs, the system fails to store the new data because it sees that hash number already. Such a "false positive" can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Other vendors are examining metadata to identify data, thereby preventing hash collisions.

Other forms of data reduction

Data deduplication is typically used in conjunction with other forms of data reduction, such as compression and delta differencing. In data compression technology, which has existed for about three decades, algorithms are applied to data in order to simplify large or repetitious parts of a file.

Delta differencing reduces the total volume of stored data by saving only the changes to a file since its initial backup. For example, a file set may contain 200 GB of data, but if only 50 MB of data has changed since the previous backup, then only that 50 MB is saved. Delta differencing is frequently used in WAN-based backups to make the most of available bandwidth in order to minimize the backup window.

Faster backups and recovery times

With data deduplication, at an effective compression ratio of 30:1, 300 GB could be stored on 10 GB of disk space. It's easy to see how this can lead to big savings, since not only do fewer disks need to be purchased, but disks also take longer to fill.

Data deduplication also has collateral benefits. Less data can be backed up faster, resulting in smaller backup windows, smaller (more recent) recovery point objectives (RPOs) and faster recovery time objectives (RTOs). Disk archive platforms are able to store considerably more files. If tape is the ultimate backup target, smaller backups also use fewer tapes, resulting in lower media costs and fewer tape library slots being used.

For a virtual tape library (VTL), the reduction in disk space requirements translates into longer retention periods for backups within the VTL itself. For example, an ordinary VTL might save backups for 30 days, then offload the oldest backup to tape in order to free up disk space for subsequent backups. Since, with data deduplication, the effective disk space can be dramatically expanded, a VTL might be able to hold two year's worth of backups, significantly reducing dependence on tape systems.

Data deduplication also speeds up remote backup, replication and disaster recovery processes. Data transfers are accomplished sooner, freeing the network for other tasks, allowing additional data to be transferred or reducing costs through the use of slower, less-expensive WANs.

Read more on Business applications