A spectre is stalking the world’s datacentres, and it is
one that is ever increasing in volume. That spectre is
data storage, and by 2010, the amount of
data we produce is predicted to exceed the capacity of the
world’s storage systems.
That is the prediction of research group IDC, which issued
a report in March saying the amount of data created globally
will increase to 988 exabytes – that’s 988 billion gigabytes – by
2010 while the capacity of storage systems will reach 600
exabytes.
Such expansion is replicated at enterprise level too, and it
comes from the massive growth in electronic communication and
business processes. Every time, for example, someone copies
colleagues in on an e-mail with a presentation attached, the data
volume multiplies by the number of copies. And the processes used
to back up data exacerbate the situation too, with full back-ups
duplicating incremental ones, for example.
According to
research company Macarthur Stroud International, businesses
hold an average of between three and five copies of files, with 15%
to 25% of organisations having more than 10.
Such duplication of data has real implications for enterprises,
not just in storage capacity required, but in time and resources
spent backing up and transporting data.
This is where data de-duplication comes in. The core technology
is not new, and it is essentially data compression applied to
storage. However, in the past year there has been a flurry of
acquisition activity such as
EMC’s £80m takeover of Avamar, and
ADIC (now Quantum) buying Rocksoft for £30m.
But what benefits can data de-duplication bring, how do
suppliers’ offerings differ, and what should you consider – in
technology and market terms – if thinking of going down the
de-duplication road?
Data de-duplication suppliers claim reduction ratios of anywhere
between 10:1 and 50:1. Such ratios allow you to reduce the sheer
physical amount of storage capacity you need to use. The two key
benefits that flow from this are that you can move from tape to
disc, and in some cases you can reduce data volumes so far that
back up via IP networks becomes possible.
With, for example, a ratio of 20:1 compression, discs can hold
40 days’ worth of data where they previously would hold only two.
If you have to keep six weeks of e-mail, for example, you might
back up the first two days to disc and days three to 42 to tape.
With data de-duplication you can put it all on disc.
Kevin Platz, managing director of EMEA sales for Data Domain,
says, “Most people who are backing up data and living with the pain
of tape can benefit from data de-duplication. With de-duplication
ratios of a factor of 20 it is possible to replace tape with
disc.”
“It is most suited to high-use cases, particularly where there
are e-mail systems and databases putting demands on storage, and
often if long-term storage is required, tape has been the only
option. It is also applicable for remote back-up situations that
would have been done by physically transporting tape – now the data
can go over the network.”
At present, data de-duplication does not come cheap, however, so
it is best suited to organisations that can justify the cost.
Frank Bunn, marketing manager at Symantec and on the board or
directors at
Storage Networking Industry
Association Europe, says, “It makes sense for people trying to
reduce complexity in the datacentre and reduce storage capacity.
Data de-duplication is better suited at the moment to bigger
enterprises, although we will see a move over the next two or three
years towards the mid-market, and even consumers.”
So, how does data de-duplication work? Methods vary between
suppliers according to the precise algorithm used, but it is all
essentially data compression, where blocks of code are identified
and if they recur they are removed and flagged up with a much
shorter piece of code.
On restoration, the software knows to replace the original block
wherever the flagging method indicates.
But the similarities end there. Different suppliers’ products
vary a great deal, and if you are thinking of opting for data
de-duplication you will need to consider which supplier’s product
fits your needs.
Some data de-duplication products are hardware appliances
packaged as part of tape or disc libraries, such as Data Domain and
Diligent.Others come as part of back-up programs, such as
Symantec’s incorporation of de-duplication technology into
Enterprise Vault and Netbackup. Some are software-based, such as
Sepaton and FalconStor’s, although the latter also provides
appliances.
Another key distinction is whether these products carry out
de-duplication at the source – such as the
Avamar products bought by EMC, Symantec and Asigra – or at the
target, such as
Diligent,
FalconStor and
Sepaton.
It is an important distinction if, for example, your aim is to
send back-ups across an IP network. Back up at the target often
means two back-ups in reality – an initial one of all data, then a
slimming down as duplicate code blocks are removed.
Speed of throughput is another area of contention. Disc and tape
libraries run at speeds of up to 500mbps, but not all data
de-duplication technologies match this. Data Domain’s appliances
have a throughput of 100mbps and Diligent has published a rate of
220mbps on its Protectier virtual tape library. Such performance
considerations will determine which products are suited to, say,
the datacentre or the branch office.
Cost is another key differentiator. Gartner has surveyed the
market and found prices for data de-duplication products range from
about £4,500 per terabyte with software products such as Avamar, to
£9,500-£55,000 per terabyte in hardware products such as Data
Domain’s appliance and gateways.
With such a variance of attributes available on the market,
users have to be clear about what they need data de-duplication
for, says Bunn. “The user has to start from the position of working
out what they need it for. There are a lot of products out there
and they work in ways that suit different applications.”
Dave Russell, research vice-president, servers and storage at
Gartner, says, “It depends what implementation approach is right
for you – do you want to de-duplicate at the target or the source?
Will the de-duplication technology work with your environment? And
will that environment hold for the next couple of years?
“There are also a number of different algorithms used and this
can sometimes create data integrity issues.”
Data de-duplication is in some senses an emerging market,
although some products have been around for quite some time. It is
also a market going through upheaval as big storage players buy out
the smaller businesses.
Claus Egge, program director, European storage systems research
at IDC, believes the market will change significantly in the coming
years, but says that is no reason not to commit. “The savings in
terms of data are very impressive, so if you can do it, why not?”
he says.
“But it is very early to say if the market will really take off.
Data de-duplication is achieving some prominence in the media right
now, but that might be to do with the acquisition activity that is
taking place, with suppliers lining up to get taken over by bigger
companies.
“As that takes place we will see data de-duplication become a
feature of existing systems, and in five to seven years it will be
taken for granted.”
Russell believes the market has yet to settle down, but says
benefits can be gained by implementing now.
“It is reasonable to expect that more acquisitions will take
place. Data Domain, Exagrid and Sepaton are all candidates, but
some of these companies are getting so big that the chance of them
being swallowed up is decreasing,” he says.
“At this point the benefits appear to be significant enough and
I do not think users need to wait and see what happens in the
market to make decisions. The key thing is to ensure that the
operating systems, applications and databases in your environment
can be dealt with by the product.
“After that, ask for customer references from businesses of a
similar size to your own using the technology in production, not
testing.”
Data de-duplication suppliers slug it out >>
Case Study: Ordnance Survey
Ordnance Survey has trialled Data Domain data de-duplication
technology and is set to implement two 560DDX 5Tbyte units with a
50:1 de-duplication ratio.
During the trial Ordnance Survey found it achieved 100% data
reliability and could dispense with tape for all
30-day data.
Mastermap is Ordnance Survey’s digital representation of the
real world which contains more than 450 million uniquely identified
geographic features.
It is updated daily as a consistent framework for the
referencing of geographic information in Great Britain and provides
the basis for topographic, road/rail, address and imagery products
sold to organisations.
Mastermap uses a 1.5Tbyte database that sees up to 5,000 changes
a day. This means that with the various data retention periods in
operation it backs up 70Tbytes a day.
Ordnance Survey decided to go down the de-duplication route
after suffering poor data reliability with tape, but the switch has
also simplified the back-up procedure.
“We were multi-streaming data to many hundreds of single tapes,”
says Dave Lipsey, information systems infrastructure manager with
Ordnance Survey.
“We were getting only about 95% reliability and this was the
main reason we opted for data de-duplication. We have had 100%
success in back-ups and restores. With small files we are getting
them done 10 times more quickly, and with large ones double the
speed.”
Lipsey estimates Ordnance Survey is saving about 600 tapes a
year, up to one day a week of staff time and 30 van journeys. It
can decommission its oldest tape library. But the key benefit is in
the intangibles, he says.
“I will take about five years to get return on investment, but
that does not really account for the intangible benefit of full
reliability and reading, writing and being able to restore
quickly,” he says.
“We have saved up to one day a week from back-up operations. It
is completely non-disruptive, works with Netbackup and uses less
space, power and cooling.”
Ordnance Survey lets mapping go digital >>
Comment on this article: e-mail
computer.weekly@rbi.co.uk