Data deduplication is commonplace nowadays in backup and is a common feature of backup software products and disk-based backup devices.
But data deduplication – or more commonly single instance storage – can also be used for primary data, that is, data in use for production purposes. Questions arise, however, over whether and when to use data deduplication in a primary data environment.
Data deduplication and single instance storage work by comparing data and discarding duplicate elements. Only the first iteration is kept, with pointers replacing subsequent copies.
Data deduplication products and features operate at different levels. Data can be compared with granularity that ranges from file level – this is what is usually called single instance storage – right down to sub-file block level. The deduplication process can take place inline, as data is being written, or post-process following its ingestion.
The impact of deduplication on performance
One drawback to the use of deduplication with primary data is the impact the processing makes on performance, especially if it takes place inline. Post-process deduplication can nullify that performance hit, as long as there are opportunities for the process to run when the data is not in demand.
Performance always matters, of course, but how much it matters in a primary data environment is going to depend on a variety of factors. For many applications a slight performance hit may be of little importance, for example with Windows shares, where a few tens of milliseconds of latency make little difference. It may also confer benefits down the line by reducing the volume of data to be backed up.
In contrast, for a database server performing thousands of financial transactions per second, each millisecond lost to the deduplication process is added latency and potentially lost business.
There are other issues. If you encrypt primary data, you will gain little by deduplicating it at sub-file level because the data is effectively randomised, with few duplicate blocks remaining. Also, transient data written to primary storage for a only a short time – such as message queuing systems or temporary log files – should be excluded, as it is usually not worth processing short-term data.
So, deduplication of primary data will usually likely result in lower capacity savings, compared to deduplicating backup and archive data, because there’ll be fewer duplicate files and blocks. There’s also often going to be a performance hit. For these reasons, few large enterprises deduplicate primary data, said ESG analyst Mark Peters.
“The market for primary deduplication really starts at the mid-market where cost matters more than performance," said Peters.
"And tier one storage doesn’t have it yet, with the exception of NetApp, which some might argue isn't tier one.”
Peters added that approaches to deduplication are driven as much by business as by technology.
"While suppliers will tell you about a ‘lack of customer pressure’, let’s not be blind to the fact that deduplication means a need for less actual disk space – which is not great news if a large chunk of your revenues and margins come from selling disk," he said.
Primary data deduplication in storage products
Data deduplication on mainstream supplier primary storage is mostly a feature on midrange NAS storage products and is often single instance storage.
Here is a rundown of what is available and the approaches to primary deduplication taken by array suppliers.
Dell's PowerVault series of NAS devices use Windows Storage Server's single instance storage capability to deduplicate primary data. Operating at file level, the post-process technology scans storage volumes for duplicated files. When the service finds a duplicate, it copies it into a central folder and replaces the duplicate with a link to the central copy.
EMC's deduplication is post-process single instance storage that operates on flash drives as well as spinning media. It runs on all midrange VNX unified storage platform devices except the entry level VNX5100. The process compresses a file, gives it a hash value, then compares it to existing files. If a match is found, one is deleted.
HP, like Dell, offers single instance storage via Windows Storage Server on its Proliant NAS storage servers.
IBM offers block-level, post-process deduplication as a no-cost option for its N-series midrange NAS systems. It works using a 4K data block hashing algorithm to filter out potential duplicates, then performs a byte-by-byte comparison to guarantee a perfect match before the duplicate block is deleted.
NetApp's FAS series of filers uses post-process, block-level deduplication that can be invoked automatically or scheduled through the command line interface or via the company's management GUI. It works by storing the hash signatures of each 4K block in a database, detecting identical signatures, then comparing that pair of blocks byte-by-byte. If identical, the duplicate is deleted and a change log entry is created.
Start-ups build in primary deduplication
In contrast to the mainstream suppliers primary, deduplication is seen as an opportunity by storage start-ups and is usually sub-file deduplication rather than single instance storage.
Analyst Greg Schulz of StorageIO said: “Start-ups generally have the advantage with clean sheet approaches, in that they don’t have the backwards compatibility or installed base to be concerned about.”
Data deduplication has been incorporated by start-ups into new architectures that use it explicitly for primary data use cases.
Tegile and Nimble, for example, sell appliances that contain mixed flash and spinning disk drives. They use data deduplication to store the bulk of lesser used data in a deduplicated state on cost-effective Sata drives and rehydrate it when it is moved to fast flash storage as hot data. This approach allows such suppliers to claim big savings in disk capacity use.
Meanwhile, FreeNAS and Nexenta's open source storage operating systems are built on ZFS and use its inline, block-level, hash signature-based data deduplication, with the option to additionally verify matching blocks.
NexGen uses a different approach to deduplication. It uses metadata, not hashing, to locate duplicate data, then deduplicates only when resources become available. This approach is more capacity-efficient, according to the company.
Pure flash array suppliers are also implementing deduplication. Nimbus, for example, reduces capacity usage with inline deduplication allied to thin provisioning. SolidFire and Pure Storage offer similar benefits but add compression.