A guide to data de-duplication

With analysts predicting data will exceed storage limits in three years, data de-duplication could be the answer for businesses. Here’s what you need to know

A spectre is stalking the world’s datacentres, and it is one that is ever increasing in volume. That spectre is data storage, and by 2010, the amount of data we produce is predicted to exceed the capacity of the world’s storage systems.

That is the prediction of research group IDC, which issued a report in March saying the amount of data created globally will increase to 988 exabytes – that’s 988 billion gigabytes – by 2010 while the capacity of storage systems will reach 600 exabytes.

Such expansion is replicated at enterprise level too, and it comes from the massive growth in electronic communication and business processes. Every time, for example, someone copies colleagues in on an e-mail with a presentation attached, the data volume multiplies by the number of copies. And the processes used to back up data exacerbate the situation too, with full back-ups duplicating incremental ones, for example.

According to research company Macarthur Stroud International, businesses hold an average of between three and five copies of files, with 15% to 25% of organisations having more than 10.

Such duplication of data has real implications for enterprises, not just in storage capacity required, but in time and resources spent backing up and transporting data.

This is where data de-duplication comes in. The core technology is not new, and it is essentially data compression applied to storage. However, in the past year there has been a flurry of acquisition activity such as EMC’s £80m takeover of Avamar, and ADIC (now Quantum) buying Rocksoft for £30m.

But what benefits can data de-duplication bring, how do suppliers’ offerings differ, and what should you consider – in technology and market terms – if thinking of going down the de-duplication road?

Data de-duplication suppliers claim reduction ratios of anywhere between 10:1 and 50:1. Such ratios allow you to reduce the sheer physical amount of storage capacity you need to use. The two key benefits that flow from this are that you can move from tape to disc, and in some cases you can reduce data volumes so far that back up via IP networks becomes possible.

With, for example, a ratio of 20:1 compression, discs can hold 40 days’ worth of data where they previously would hold only two. If you have to keep six weeks of e-mail, for example, you might back up the first two days to disc and days three to 42 to tape. With data de-duplication you can put it all on disc.

Kevin Platz, managing director of EMEA sales for Data Domain, says, “Most people who are backing up data and living with the pain of tape can benefit from data de-duplication. With de-duplication ratios of a factor of 20 it is possible to replace tape with disc.”

“It is most suited to high-use cases, particularly where there are e-mail systems and databases putting demands on storage, and often if long-term storage is required, tape has been the only option. It is also applicable for remote back-up situations that would have been done by physically transporting tape – now the data can go over the network.”

At present, data de-duplication does not come cheap, however, so it is best suited to organisations that can justify the cost.

Frank Bunn, marketing manager at Symantec and on the board or directors at Storage Networking Industry Association Europe, says, “It makes sense for people trying to reduce complexity in the datacentre and reduce storage capacity. Data de-duplication is better suited at the moment to bigger enterprises, although we will see a move over the next two or three years towards the mid-market, and even consumers.”

So, how does data de-duplication work? Methods vary between suppliers according to the precise algorithm used, but it is all essentially data compression, where blocks of code are identified and if they recur they are removed and flagged up with a much shorter piece of code.

On restoration, the software knows to replace the original block wherever the flagging method indicates.

But the similarities end there. Different suppliers’ products vary a great deal, and if you are thinking of opting for data de-duplication you will need to consider which supplier’s product fits your needs.

Some data de-duplication products are hardware appliances packaged as part of tape or disc libraries, such as Data Domain and Diligent.Others come as part of back-up programs, such as Symantec’s incorporation of de-duplication technology into Enterprise Vault and Netbackup. Some are software-based, such as Sepaton and FalconStor’s, although the latter also provides appliances.

Another key distinction is whether these products carry out de-duplication at the source – such as the Avamar products bought by EMC, Symantec and Asigra – or at the target, such as Diligent, FalconStor and Sepaton.

It is an important distinction if, for example, your aim is to send back-ups across an IP network. Back up at the target often means two back-ups in reality – an initial one of all data, then a slimming down as duplicate code blocks are removed.

Speed of throughput is another area of contention. Disc and tape libraries run at speeds of up to 500mbps, but not all data de-duplication technologies match this. Data Domain’s appliances have a throughput of 100mbps and Diligent has published a rate of 220mbps on its Protectier virtual tape library. Such performance considerations will determine which products are suited to, say, the datacentre or the branch office.

Cost is another key differentiator. Gartner has surveyed the market and found prices for data de-duplication products range from about £4,500 per terabyte with software products such as Avamar, to £9,500-£55,000 per terabyte in hardware products such as Data Domain’s appliance and gateways.

With such a variance of attributes available on the market, users have to be clear about what they need data de-duplication for, says Bunn. “The user has to start from the position of working out what they need it for. There are a lot of products out there and they work in ways that suit different applications.”

Dave Russell, research vice-president, servers and storage at Gartner, says, “It depends what implementation approach is right for you – do you want to de-duplicate at the target or the source? Will the de-duplication technology work with your environment? And will that environment hold for the next couple of years?

“There are also a number of different algorithms used and this can sometimes create data integrity issues.”

Data de-duplication is in some senses an emerging market, although some products have been around for quite some time. It is also a market going through upheaval as big storage players buy out the smaller businesses.

Claus Egge, program director, European storage systems research at IDC, believes the market will change significantly in the coming years, but says that is no reason not to commit. “The savings in terms of data are very impressive, so if you can do it, why not?” he says.

“But it is very early to say if the market will really take off. Data de-duplication is achieving some prominence in the media right now, but that might be to do with the acquisition activity that is taking place, with suppliers lining up to get taken over by bigger companies.

“As that takes place we will see data de-duplication become a feature of existing systems, and in five to seven years it will be taken for granted.”

Russell believes the market has yet to settle down, but says benefits can be gained by implementing now.

“It is reasonable to expect that more acquisitions will take place. Data Domain, Exagrid and Sepaton are all candidates, but some of these companies are getting so big that the chance of them being swallowed up is decreasing,” he says.

“At this point the benefits appear to be significant enough and I do not think users need to wait and see what happens in the market to make decisions. The key thing is to ensure that the operating systems, applications and databases in your environment can be dealt with by the product.

“After that, ask for customer references from businesses of a similar size to your own using the technology in production, not testing.”

Data de-duplication suppliers slug it out >>

Case Study: Ordnance Survey

Ordnance Survey has trialled Data Domain data de-duplication technology and is set to implement two 560DDX 5Tbyte units with a 50:1 de-duplication ratio.

During the trial Ordnance Survey found it achieved 100% data reliability and could dispense with tape for all
30-day data.

Mastermap is Ordnance Survey’s digital representation of the real world which contains more than 450 million uniquely identified geographic features.

It is updated daily as a consistent  framework for the referencing of geographic information in Great Britain and provides the basis for topographic, road/rail, address and imagery products sold to organisations.

Mastermap uses a 1.5Tbyte database that sees up to 5,000 changes a day. This means that with the various data retention periods in operation it backs up 70Tbytes a day.

Ordnance Survey decided to go down the de-duplication route after suffering poor data reliability with tape, but the switch has also simplified the back-up procedure.

“We were multi-streaming data to many hundreds of single tapes,” says Dave Lipsey, information systems infrastructure manager with Ordnance Survey.

“We were getting only about 95% reliability and this was the main reason we opted for data de-duplication. We have had 100% success in back-ups and restores. With small files we are getting them done 10 times more quickly, and with large ones double the speed.”

Lipsey estimates Ordnance Survey is saving about 600 tapes a year, up to one day a week of staff time and 30 van journeys. It can decommission its oldest tape library. But the key benefit is in the intangibles, he says.

“I will take about five years to get return on investment, but that does not really account for the intangible benefit of full reliability and reading, writing and being able to restore quickly,” he says.

“We have saved up to one day a week from back-up operations. It is completely non-disruptive, works with Netbackup and uses less space, power and cooling.”

Ordnance Survey lets mapping go digital >>

Comment on this article: e-mail [email protected]


Read more on Integration software and middleware