Primary deduplication brings pros and cons to storage arrays

Primary deduplication brings reductions in disk capacity and can boost storage performance, but watch out for impacts such as higher I/O demands and data protection risks.

In recent years data deduplication has become a standard feature in backup software and disk-based backup products owing to the big cost savings it can enable, but it has yet to make the breakthrough to the mainstream of primary storage. However, most vendors offer primary storage deduplication or are set to introduce it, and a number of startup vendors now incorporate primary deduplication into the design of products aimed at supporting virtual environments.

In this interview, Bureau Chief Antony Adshead speaks with Chris Evans, an independent consultant with Langton Blue, about the pros and cons of primary deduplication and how to add primary deduplication to your existing environment. Read the transcript or listen to the podcast below.

Play now:
Download for later:

Primary deduplication brings pros and cons to storage arrays

  • Internet Explorer: Right Click > Save Target As
  • Firefox: Right Click > Save Link As What are the pros and cons of primary deduplication?

Evans: I’d like to … [set] the scene [by defining] what we mean by primary deduplication. … Deduplication is the process of looking for and then removing duplicate data on disk within a storage array. So, what we’re saying is we’re looking for repeated patterns of data that we could remove and then using metadata to point to that one single instance of it. Sometimes it’s referred to as single-instance storage.

When we say primary storage dedupe, we’re talking about primary storage arrays that we [use to] deliver out Notes and databases and all those other things, and actually putting dedupe into that product. And that’s different to, say, deduplication in a backup environment.

Now … let’s talk about pros and cons. Clearly, one of the main reasons for doing dedupe is to remove cost. In the backup world, we see massive savings from deduplicating the backup data, and that comes because a lot of that is repeated data.

In the primary storage world, we see a similar situation where we can do storage reduction and therefore we can make some good savings. [That fits well] with virtualisation, where people are [creating] multiple copies of, say, desktops (as part of VDI) or multiple server instances. So clearly, cost saving is a big [pro].

There’s obviously the benefit to performance. If you are writing to disk, you’re only using that small subset of data, then you’ve got increased performance because you’re only having to read and write a small amount of data. That also translates into deployment of hardware; if you can reduce your footprint, you’ll be making [savings] in terms of environmental [costs].

We also see a benefit in terms of replication. If we were taking some of this data to another array and we’ve already deduplicated it, we only have to replicate a small amount.

So, that was the pros. Let’s talk about some of the cons.

Now clearly, by centralising a lot of data into a smaller footprint, we are potentially increasing our risk because we’ve now got a situation where multiple LUNs, or files perhaps, are all centralised and delivered off the same set of physical disks. So, we need to make sure we’re happy [to] manage that risk and take backups, but we have to be aware that that centralisation process does increase risk.

The other thing we need to look at is performance, and there are a number of issues [to do with performance and deduplication].

First of all, deduplication creates more of a random workload, and the reason for that is that as data is written to an array, we are now pulling out blocks of data and releasing it. By doing that, what’s left is more random.

As we write data to that array, we either will be deduplicating inline, which means immediately, or we will do it as a post-process operation. If we do it inline, then we have a performance impact because we have to manage that matching of patterns before we write it to disk. If we do it as a post-processing task … then we have to keep a small amount of extra storage available until we can deduplicate that new data we’ve just written.

So, those are the cons in this environment. What are the pros and cons of adding primary deduplication to existing storage or bringing it in in a separate array or product?

Evans: Let’s just think about the idea of introducing deduplication into a product we already have. A lot of vendors have started to do that, and a good example of one of the first to [do so] was NetApp. They brought in primary deduplication as a function called A-SIS, and that’s been very successful. That worked well because their architecture is very well tailored to it. We obviously still see some impact on performance there because that was a device that was built to do other things, and as capacity increases, we do see performance impacts from that.

Looking at other vendors, we’ve seen the idea of deduplication being brought in to the newer products that are coming to the market. What we’re seeing is that there are benefits there if you are coming up with a new architecture because you can build deduplication into your product; it’s much more difficult to retrofit it into the older or more legacy products.

One of the benefits of having deduplication built into the product itself is it means that ROI for that product … can be improved. A good example of where that can happen is with solid-state drive (SSD) arrays. SSD is very expensive, but if we’re getting compression and deduplication on that primary data, then clearly we can effectively demonstrate a more efficient ROI and justify that device.

But clearly, those architectures need to be designed to take deduplication as part of their design, and retrofitting it in to an existing array can give problems, and we haven’t seen many vendors try that.

In terms of having [deduplication] as a separate array or product, we originally saw Ocarina, [which] came out with a device that was a separate product. You could put that in front of a standard storage array and then write your data to the underlying array.

Clearly, if you’ve got a separate device for that, the device can be optimised to do one task -- deduplication -- and the storage array continues to do its job. If you wanted to replace that storage in that environment, you could take that array out and put another one in, and the deduplication layer continues as it did before.

Obviously, if they’re separate products, one thing we have to be very careful of is that the two products might find it difficult to coexist with each other. The deduplication layer could be now producing I/O that’s very random, and, therefore, the array itself might not cope with that very well, and if the products don’t coexist together, that could be a problem. So there needs to be thought given to how that sort of technology is set up and configured.

Overall, it depends what you’re looking to achieve as to which route you decide to go for.

Read more on Storage management and strategy