Disk space reduction is a key consideration for many organisations that want to reduce storage costs. With this aim in mind, data deduplication has been deployed widely on secondary systems, such as for data backup.
Primary storage deduplication, however, has yet to reach this level of adoption. In this article we will discuss how primary storage deduplication works, what it can achieve and how we expect to see it increase in prevalence as a core feature of storage arrays.
Data deduplication is the process of identifying and removing identical pieces of information in a batch of data. Compression removes redundant data to reduce the size of a file but doesn’t do anything to cut the number of files it encounters. Data deduplication, meanwhile, takes a broader view, comparing files or blocks within files across a much larger data set and removing redundancies from that.
So, in a data deduplication hardware setting, rather than store two copies of the same piece of data, the array retains metadata and pointers to indicate which further instances of data map to the single instance already held. In instances such as backup operations, where the same static data may be backed up repeatedly, deduplication can reduce physical storage consumption by ratios as high as 10-to-1 or 20-to-1 (equalling 90% and 95% saving in disk space, respectively).
Clearly, the potential savings in physical storage are significant. If primary storage utilisation could be reduced by up to 90%, this would represent huge savings for organisations that deploy large numbers of storage arrays.
Unfortunately, reality is not that straightforward. The use case for deduplicated data fits well with backup but not always so well with primary storage. Compared with large backup streams, the working data sets in primary storage are much smaller and contain far fewer redundancies. Consequently, ratios for primary storage deduplication can be as low as 2-to-1, depending on the type of data the algorithm gets to work on.
Having said that, as more organisations turn towards server and desktop virtual infrastructures, the benefits of primary storage deduplication reappear. Virtual servers and desktops are typically cloned from a small number of master images and a workgroup will often run from a relatively small set of spreadsheets and Word documents, resulting in highly efficient deduplication opportunities that can bring ratios of up to 100-to-1.
The deduplication saving can even be used to justify the use of solid-cstate drives (SSDs), where their raw cost would have been previously unjustifiable, a subject we will discuss below.
Pros and cons of primary storage data deduplication
Of course, primary storage deduplication is no panacea for solving storage growth issues, and there are some disadvantages alongside the obvious capacity and cost savings that can be achieved.
Before diving into a more technical description, it is worth discussing the two key data deduplication techniques in use by vendors today. Identification of duplicate data can be achieved either inline in real time, or asynchronously at a later time, known as post-processing.
Inline deduplication requires more resources and can suffer from latency issues as data is checked against metadata before being committed to disk or flagged as a duplicate. Increases in CPU processing power help to mitigate this issue and, with efficient search algorithms, performance can actually be improved if a large proportion of the identified data is duplicated as this data doesn’t need to be written to disk and metadata can simply be updated.
Post-processing data deduplication requires a certain amount of storage to be used as an overhead until the deduplication process can be executed and the duplicates removed. In environments with high data growth rates, this overhead starts to cut into the potential savings.
For both implementations, deduplicated data produces random I/O for read requests, which can be an issue for some storage arrays. Storage array vendors have spent many years optimising their products to make use of sequential I/O and prefetch.
Deduplication can work counter to this because over time it pulls apart the “natural” sequence of blocks found in unreduced data, making gaps here and placing pointers there and spreading parts of the file across many spindles. Users can deal with this issue by the addition of flash as a top tier in working data, which provides rapid-enough access to combat the type of randomisation that’s an issue for spinning disk. Some vendors mentioned below -- the SSD startups -- have seen the boost that flash can give to primary data deduplication and designed it into their product architectures from the start.
Now let’s discuss how vendors have implemented deduplication technology into their primary storage systems.
- NetApp. NetApp was the first vendor to offer primary storage deduplication in its filer products five years ago, in May 2007. Originally called A-SIS (Advanced Single-Instance Storage), the feature performs post-processing deduplication on NetApp volumes. Many restrictions were imposed on volumes configured with A-SIS; as volume sizes increased, the effort required to find and eliminate duplicate blocks could have significant performance impacts. These restrictions have been eased as newer filers have been released with faster hardware. A-SIS is a free add-on feature and has been successful in driving NetApp in the virtualisation market.
- EMC. Although EMC has had deduplication in its backup products for some time, the company’s only array platform to currently offer primary storage deduplication is the VNX. This capability restricted to file-based deduplication, traced to the part of the product that was the old Celerra. EMC has talked about block-level primary storage deduplication for some time, and we expect to see that in a future release.
- Dell. In July 2010 Dell acquired Ocarina Networks. Ocarina offered a standalone deduplication appliance that sat in front of traditional storage arrays to provide inline deduplication functionality. Since acquisition, Dell has integrated Ocarina technology into the DR4000 for disk-to-disk backup and the DX6000G Storage Compression Node, providing deduplication functionality for object data. Dell is rumoured to be working on deploying primary storage deduplication within its Compellent products.
- Oracle and vendors that support ZFS. As the owner of ZFS, Oracle has had the ability to use data deduplication in its storage products since 2009. The Sun ZFS Storage Appliance supports inline deduplication and compression. The deduplication feature also appears in software from vendors that use ZFS within their storage platforms. These include Nexenta Systems, which incorporated data deduplication into NexentaStor 3.0 in 2010, and GreenBytes, a startup specialising in SSD-based storage arrays that also makes use of ZFS for inline data deduplication.
- SSD array startups. SSD-based arrays are well suited to coping with the impacts of deduplication, including the random I/O workloads already discussed. SSD array startups Pure Storage, Nimbus Data Systems and SolidFire all support inline primary data deduplication as a standard feature. In fact, on most of these platforms, deduplication cannot be disabled and is integral to these products.
- Vendors targeting virtualisation. For platforms that specifically target virtualisation environments, Tintri and NexGen Storage offer arrays optimised for virtualisation, and both utilise data deduplication. NexGen has taken a different approach from some of the other recent startups and implements post-processing deduplication with its Phased Data Reduction feature.
Primary storage data deduplication offers the ability to reduce storage utilisation significantly for certain use cases and has specific benefits for virtual server and desktop environments. The major storage vendors have struggled to implement deduplication into their flagship products -- NetApp is the only obvious exception to this -- perhaps because it reduces their ability to maximise disk sales.
However, new storage startups, especially those that offer all- or heavily SSD-reliant arrays, have used that performance boost to leverage data deduplication as a way of justifying the much higher raw storage cost of their devices.
So, we can say primary storage deduplication is here to stay, albeit largely as a result of its incorporation into new forms of storage array.