How suitable are data deduplication and compression for use in primary storage scenarios? They’re quite different technologies, and their suitability as data reduction techniques for use with primary storage will vary depending on the use case. In some cases–where the type of data and application favours it–data deduplication and data compression could both be applied to the same data set.
In this interview, SearchStorage.co.UK Bureau Chief Antony Adshead speaks with Steve Pinder, practice lead for storage with GlassHouse Technologies (UK), about the difference between data deduplication and compression, their suitability for use with primary storage and how you can determine whether these data reduction techniques are suited to your primary storage environment.
You can either read the transcript below or listen to the podcast on data reduction techniques for primary storage.
SearchStorage.co.UK: What is the difference between data compression and data deduplication as data reduction techniques in primary data scenarios?
Pinder: Data compression and deduplication are not separate names for the same thing but are actually two different technologies. As they’re not related, they can be used as independently or can even complement each other to further save space on primary storage.
First of all, compression is the process of using algorithms to reduce the amount of physical space that a file takes up. It usually involves changing the format of the file and therefore can be quite a labour-intensive process. A good example is “zipping” up documents to send in emails. … There are two main types of compression algorithm: lossless and lossy.
Lossless algorithms, as the name suggests, convert the data without any loss of the original file whatsoever. So, it compresses the amount of data that the file takes up, but you can then uncompress the file and it returns back to the exact original file.
Lossy algorithms are more efficient than lossless algorithms, but if you uncompress them it’s impossible to get back to the … original file.
For lossless compression, good examples of where we use this technology are spreadsheets and text files where we need the data. Lossy compression can be used in scenarios such as video streaming and photographs, where what [generally happens] is you get slight loss of quality of the file but it’s not discernable to the naked eye.
It should be noted, however, that if you use a lossy data compression algorithm over and over again, you’ll get to a stage where the original file is not of sufficient quality.
Deduplication, by contrast, reduces the size of large data sets by removing information that’s duplicated and leaving a pointer to the original data. So, you’ve got one copy of the data and pointers to that [from where other examples of the same] data used to be.
Data deduplication can work at the file or block level. As an example, if I were to send an email to 20 people with an attachment, there’d be 20 copies of the attachment in the email system. A data deduplication appliance could do is keep the original copy of that attachment and then put pointers to it from where the other copies of the file would be. So, data deduplication can be very efficient where you have lots of copies of user file data or lots of pages of data that are the same.
Deduplication has been used for a long time in backup applications but is becoming a more acceptable method for saving space in primary file systems as well.
SearchStorage.co.UK: What are the benefits and limitations of these two data reduction techniques when applied to primary data storage?
Pinder: Take … the example I gave previously of “zipping” a file up, it’s pretty impractical to have to use that type of compression route for primary storage on a day-to-day basis. It’s not acceptable to expect the user to compress and uncompress data as they want to use it. For one-off operations like sending emails, it’s fine but for day-to-day operations it’s not acceptable.
[But] appliance-based engines that can compress data as it’s written to primary storage [are becoming available]. These are becoming more prevalent in NAS-based workloads and can offer greatly reduced storage footprints with little or no performance degradation. Indeed, some of the vendors that offer these sorts of appliances and services claim that the performance of the file system is increased because the data that’s written to the storage is actually smaller because it is compressed.
Deduplication has traditionally been used only for nearline or archive file systems because there’s been quite a high I/O overhead to process the data, although improvements have made them more viable for many environments as well. Good candidates for data deduplication are virtual environments and user home directories.
In VM environments, if we have many VMs on the same physical machine, they use the same OS and application files, and that lends itself well to deduplication. With home directories, with many people saving the same file to the data area, deduplication can be very efficient in these areas.
One thing that needs to take place when you’re considering deduplication for a primary data area is to run a proof-of-concept because not all data types lend themselves to data deduplication--for example, databases that are updated all the time. What can happen with these data sets is that the process of trying to deduplicate the data adds an overhead to the file system, and performance will be greatly reduced.
So, it’s important to run a proof-of-concept before considering data deduplication or compression on primary storage arrays.
So, just to recap, benefits [of compression and data deduplication for primary storage include getting] a large decrease in physical capacity used, but they’re not suitable for all data types.
This was first published in June 2011