"It just works," said Peter Raettig, IT operations manager at London-based newspaper publisher Trinity Mirror, which uses the data deduplication feature built into its NetApp filers. "We got it with the NAS boxes. We didn't enable it straight away, but when we did, it was very boring -- we just switched it on and it worked."
Trinity Mirror is using deduplication on the volumes where they have operating systems, and they have "had 80% to 90% reduction there," Raettig said. "Another area where we've had some success is a 10% to 15% reduction so far on pictures, page layouts, stories and so on."
The real win with primary storage data dedupe is that it allows you to meet the increased demands for storage capacity without buying more hardware, said Phil Knowles, Unix systems manager at Hertfordshire-based Imagination Technologies, which develops multimedia processor hardware and software. The nature of its business means his company needs to retain project data on primary storage, but Knowles was finding this increasingly difficult to achieve.
"Our server rooms were at their maximum capacity, and disk storage uses the majority of that space," he explained. "We really couldn't afford an expansion in disk space and didn't want to throw away perfectly good equipment to replace it with higher-density storage."
So Imagination Technologies bought deduplication appliances from Ocarina Networks, which sit in front of the storage and offload the work involved. "It works fine, and we're getting reduction ratios of nearly 80%, much better than other file system compression tools we benchmarked," Knowles said.
How does primary storage data dedupe work?
Data deduplication looks for repeated patterns within data streams and replaces them with pointers to a shared copy. It does this by using a cryptographic algorithm to generate a checksum called a hash for each chunk of data – a sort of digital fingerprint. If a new hash matches a stored one, the new data is replaced by a pointer to the existing chunk.
The stronger your algorithm, the less likely it is that two different blocks will yield the same hash, in what's known as a hash collision. The strong option is 256-bit SHA-256, which Jeff Bonwick, the Sun Fellow who co-developed the deduplication function for Sun Microsystems' ZFS file system, said is 50 orders of magnitude more secure than the most reliable ECC memory.
The downside is that the more reliable you want your hashes to be, the higher the processor overhead. However, Bonwick said that if you know your data includes a lot of redundancy, you could get more speed by using a weaker hashing algorithm plus verification to ensure the two blocks truly are the same. Data dedupe can take place at three different levels: file, block and byte. File-level data deduplication, also known as single-instance storage, looks for multiple copies of the same file and stores only one. Lotus Notes and Novell Groupwise do this for email attachments, for example, and so did Microsoft Exchange -- until Exchange 2010, when the feature was removed. The software giant's argument was that the overhead it placed on the file system was no longer worth the physical space it saved.
Going to block-level deduplication enables you to deduplicate files that are similar but not quite the same, such as virtual machine images or PowerPoint files with just the opening slide changed. That makes it more powerful in most applications as it can achieve higher overall data reduction rates.
Byte-level deduplication is, in theory, the most powerful, but it presents extra challenges because it must also identify where duplicated elements begin and end. As a result, while the byte-level approach can work well within applications that already know their own data structure, most general purpose data deduplication engines operate at the block level.
The next question is whether to deduplicate in real-time or to do it later via a post-process or scheduled operation, perhaps running overnight. Doing deduplication later saves time and processor power but uses more disk space, as the storage subsystem must store the duplicates until the process is run.
Who sells primary storage data dedupe?
EMC Celerra devices and NetApp each offer post-process forms of data deduplication. EMC's operates at the file level, while the version that comes with NetApp filers' Data Ontap operating system operates at the block level.
Conversely, both Permabit's Scalable Data Reduction (SDR) technology and Sun's ZFS (now owned by Oracle) deduplicate in real-time. Both companies argue that the processing power of a network-attached storage (NAS) box shouldn't be the limiting factor, Permabit because its grid architecture allows it to add access nodes as needed to handle the hashing load, and Sun because it has highly multithreaded software running on multicore and multiprocessor hardware, which gives it CPU cycles to spare.
Data deduplication isn't the only way to achieve primary storage capacity reduction, however. The main alternative is random access data compression. Some vendors make use of both methods by further compressing the files and blocks that make it through the deduplication process.
Ocarina Networks, for example, makes extensive use of both approaches, having developed post-process technology that can pull files apart and deduplicate their constituent elements -- even already-compressed ones. Its ECOsystem appliance has a wide set of compression algorithms that it can apply to remaining data, each one optimised for a different file or data type. It claims this allows it to squish file types that other deduplication products make no impression on.
One company that does see compression and data dedupe as rivals -- on primary storage at least -- is Storwize. Its real-time data reduction appliances apply compression only, albeit a patented high-powered variety. Steve Kenniston, Storwize's vice president of technology strategy, argues that as well as offering better performance than deduplication, compression has the key benefit that it's lossless -- its bits may be rearranged and squashed down, but each file is still there on disk in its entirety.
When to use data deduplication for primary storage
Deduplicating primary storage is probably not for everyone. Applied to the right datasets -- which typically means relatively static ones at the nearline end of the primary storage spectrum -- it can yield big savings; on the wrong data, it will add latency that you don't want.
"You need to test it with your data before committing to a purchase," Imagination Technologies' Knowles said. "For us, testing before we bought highlighted some significant differences with the way we and Ocarina interpreted compression rules. We resolved these very quickly, but the data type is unique to each company."
"It's fair to say that it's more suited to static types of data," Trinity Mirror's Raettig said. "It was very helpful for our technical people to understand how it worked. Obviously, we don't use it for swap space."
Users and analysts alike see primary storage data dedupe becoming a standard feature of devices such as NAS filers, to be turned on or off for specific datasets as needed.
"It's absolutely something to expect in a storage system," Raettig said. "When we chose our storage, deduplication was definitely a consideration."
"I think it's almost like data loss prevention -- it will get baked into everything," said Tony Lock, programme director at industry analyst firm Freeform Dynamics. "It's going to happen in primary storage in lots of areas, but when is up for debate."
According to a recent report by Gartner, "by 2013, the majority of primary NAS storage will be deduplicated, reducing capacity costs by 50%."
Could this herald a NAS resurgence? Perhaps, as the report's authors note that data deduplication is easier to implement on file-based NAS than on block-based storage-area networks (SANs). They also suggest that businesses should consider "changing the ratio of NAS to SAN storage when refreshing their infrastructure" to take advantage of this.