Maksym Yemelyanov - Fotolia

Dedupe technology: A natural partner to backup

See how three UK IT organisations are saving money and storage capacity with data deduplication for backup.

As many UK IT organisations are learning, data deduplication is a natural fit with backup infrastructure that can save you time and money. Applying data deduplication at the backup source, for example, can shrink backup windows  significantly; the users we spoke with reported average deduplication ratios of around 22:1.

Dedupe also reduces the amount of space backups take up in terms of storage capacity.

Hamish Macarthur, CEO at Surrey-based analyst house Macarthur Stroud International, said, "[Data deduplication] technology is ready to deliver. It resonates with today's levels of data growth and allows people to get better use from their storage capacity."

But, there are challenges too. Key to success is to ensure that data deduplication is carefully planned and managed, said Macarthur. "Data deduplication can make for longer restores so you have to be careful how you use it. And data needs to be properly positioned to take advantage of deduplication.”

"You have to look at data deduplication in the context of specific applications and how and where they run. For example, because restore times can be affected, you might not apply it to some critical applications."

Dedupe technology implementation issues

Data deduplication products compare incoming data with what is already stored. If the incoming data already exists, it is replaced with a pointer rather than stored again. While all data deduplication technology uses the same basic technique, products vary in a few key ways.

One key differentiator between products is whether they carry out deduplication inline or post-process. Inline data deduplication works on data before it is committed to disk. This cuts down on the disk space required, but the deduplication process can slow the ingestion of backup data. This contrasts with post-process deduplication, which occurs after backup data has been ingested.

You also need to decide where deduplication will occur: at the backup source or target. Source data deduplication carries out its work at the backup source and therefore reduces data volumes before transmission to the backup target. It is most effective when it works on multiple data sets so it is best that the backup server works on several sources. It’s also worth noting that source deduplication is a CPU-intensive process that puts an additional load on application servers (the backup clients) and backup servers, which can adversely affect those systems’ primary production tasks.

Carrying out deduplication at the target  means sending un-reduced backup data sets across the network but can be more efficient than source dedupe because it works on all data incoming during the backup window and so is likely to achieve the best data reduction ratios. There is also the benefit that application and backup servers are unaffected by deduplication processing overheads on their CPUs.

Whether you want to transmit un-deduplicated backups will depend on whether your network can shoulder the load. A 10 Gigabit Ethernet (GbE) data centre network may well be able to deal with it, but the WAN may grind to a halt if remote backups are involved or if mirroring data between data centres.

Wherever you dedupe, according to Macarthur, the key to success is to understand the data sets that are being deduped and locate the technology where most appropriate.

Above all, it is vital to never to lose sight of the point of backing up, which is to restore data in case of data loss. So, be sure that if you add data deduplication into the backup equation, you can recover data if you need to, said Macarthur. "For example, different vendors use different deduplication engines; so, if you have products from multiple vendors, will you be able to successfully restore?"

What is clear from speaking to users is that deduplication technology requires careful consideration. As with any other technology, the value is in the implementation as much as in the product.

Ordnance Survey uses dedupe to deal with huge data growth

Southampton-based surveying and mapping organisation Ordnance Survey (OS) manages more than 700 TB of data and backs up more than 100 TB every week.

Prior to its deduplication project, data went to three tape libraries with 24 LTO-1 drives in total. The organisation used disk staging to gather the data before writing to tape. The problem, said support technician Mark Hunt, was the constant battle to ensure sufficient tape media for the growing data set. “We went from 10 TB to 600 TB in five years and couldn't cope. We were buying 200 tapes a month just to keep up, yet drives were failing and we spent a lot of time at weekends just checking and loading tapes."

So the OS evaluated a Data Domain DD430 and tested it on an Oracle database. Hunt said the compression ratio was 10:1 on the first run and 50:1 after that. The organisation then bought a pair of DD560s, one for installation at headquarters, the other for off-site replication.

"We wanted to get our critical data such as Exchange, Oracle and SAP deduped, replicated off-site and kept online for a month," said Hunt. "A year later we bought two DD565s and a DD580. We moved the old models off-site and used them as replication targets, and deduping meant we had no problems with the links. We still use tape for longer retention and undedupable data such as imagery, and we keep such data on certain servers to separate it out."

"Most of our data sits in Oracle databases, which deduplicate very well. Then we started getting VMware ESX in and had 200 virtual servers backed up. We saw 100:1 data reduction on those," said Hunt.

How much money has data deduplication saved the OS? Hunt said, “My previous boss put a saving of £12,000 a year on tape library maintenance and licences and also £45,000 on reapportioning primary storage. There are also a lot of other costs saved such as power and cooling in the data centre and also my time, as I used to spend at least 12 hours a week managing the tapes.”

Hunt said the average dedupe ratio gained over seven appliances is around 22:1. “This figure is lower than normal as I have been using the appliances to move data between sites and to back up systems that I would not normally put on them such as already highly compressed air photography images. Our best appliance has 298 TB of data on it, deduped and compressed to 4.6 TB, which is a ratio of 64.8:1.”

Now the OS has a new data centre in Gloucester that houses production systems that are replicated automatically to Southampton using a separate backup network. "There's no user intervention, no link problems due to data overload, no need to change tapes, and we've had no disk failures," said Hunt.

"We have 100 percent confidence that we can restore without crossing our fingers or going off-site. In four years I have not had a single issue restoring deduplicated data. I average more than 20 restores a week and have also done many restores off the replica appliances without any issues,” said Hunt.

Service provider slashes data stores with client and back-end dedupe

OncoreIT is a service provider that uses the Asigra platform to provide backup to customers. It reduces the volume of backups stored by eliminating duplicate data across customer data sets; at the source using the Asigra client and in the firm’s data centre using dedupe functionality on its storage arrays.

Chief Operating Officer David Ebsworth said OncoreIT uses deduplication because it cuts down on data volumes and saves the client money. "If we didn't have it we would be storing the same data for multiple clients, which across our 200 TB of data would be astronomical."                                                                                                     

Using a pair of BlueArc Titan 3000s in data centres in London and Amsterdam, Ebsworth estimated that data deduplication saved about 100 TB of capacity and reduced spend on new storage by around 40%.

Ebsworth said the firm gained the highest deduplication ratios from eliminating duplicated operating system images and service packs. "When one client sends the same data -- as identified by its digital signature -- that's exactly the same as data from another client, we only retain one instance. So if you have servers with the same operating systems, for example, I only store them once, which reduces the client's storage bill for it to zero.” Customers know their data is deduplicated in conjunction with that of other customers and are happy for that to happen – OncoreIT makes a selling point of the savings they gain from that.

"Dedupe works at two levels," said Ebsworth. "We dedupe on the client site before data is transmitted to our back-end data vaults. With client site deduplication, our aim is to ingest less data. Back at my vault, I dedupe and I may find that I have exactly the same data in my vault. The cost of processing deduplication is much less than the extra cost of spinning disk. If we can store less, we cost less. And it just worked out of the box."

Ebsworth reports no problems at all in restoring deduplicated data.

CMC Markets gets selective with deduplication

Financial services company CMC Markets uses data deduplication to implement live replication between two sites that provide the infrastructure for CMC's Internet-based financial spread betting services.

Since Technical Operations Manager Greg Gawthorpe oversaw the first implementation of data deduplication at CMC Markets in 2008 he’s learned a few things , especially what data is best left unreduced and what needs to be deduplicated. He is now in charge of four Data Domain 530s and two 40 TB Nexsan SATABeasts and benefits from an average reduction ratio of 23:1.

"What we did originally was throw data at our Data Domains, and when they filled up we bought more storage. We spent an inordinate amount of time managing the systems manually as we hadn’t left any headroom. We bought more storage but only after we had completed our tiering exercise.”

“We back up to the deduplication device, which replicates itself to the remote site, making copies for daily, weekly and monthly retention with some data retained for up to seven years. We take and keep multiple copies of data, some off-site, some on-site."

Gawthorpe said that scheduling and planning backups is important when the deduplication systems are running flat-out as they can become a bottleneck.

"It took us six months to organise and automate data tiering policies," he said. "Now we only send data to the deduplication devices that we need one or two copies of. For example, some backups need to remain on-site for four weeks. Others need only one cached copy so we don't dedupe that; instead it goes straight to a SATABeast. We also have offline databases that back up straight to tape, so we took those out of the deduplication stream too.

"In other words, if you don't send the dedupe devices the right data, you won't make best use of them. They're not cheap, but they definitely have a place in the infrastructure. We don't have to touch them anymore," said Gawthorpe.

He added, “When comparing the Data Domains to comparable nearline storage such as SATA we saved in the region of £100,000 on the initial outlay. With regard to power usage and associated cooling when compared to SATA, we worked out we would save approximately £25,000 per annum.”

Like Ebsworth and Hunt, CMC’s Gawthorpe reported no issues with restoration from deduplicated data.

Read more on Data protection, backup and archiving