LAS VEGAS -- Data deduplication is the hot technology of the
moment, and several users at the west coast Storage Decisions
conference said Wednesday that they were investigating
deduplication, but so far, they're not buying it.
"We've passed by the last two or three storage fads, including
SRM and
virtualization," said Brian Peterson, a
storage architect for ConAgra Foods, Inc. in Omaha, Neb. "It's
always interesting, but we prefer to wait out the 'curve of
disillusionment'."
Among the disillusionments with data deduplication products in
general, according to experts, is performance. Data Domain's boxes
have a 100 MBps throughput, and though they can be aggregated
together under one management console, the actual boxes aren't
sharing data on the back end. Dilligent advertises a throughput
rate of 220 MBps on its ProtecTier virtual tape library (VTL).
Compared with tape libraries, and VTLs, which top out around 500
GBps, that's dismal performance, according to Curtis Preston, vice
president of data protection services for GlassHouse Technologies
Inc. "100 megabytes per second is not up to par for an enterprise
backup target, though it can be fine for a
remote office," Preston said. He added that a single stream
within deduplication boxes like Diligent's and Data Domain's
that perform deduplication "inline" -- before data is written to
disk -- can be as low as 40 MBps.
"Our goal for backup throughput is 1 terabyte per hour," said
Peterson. He said ConAgra does full daily backups of critical
databases to the tune of 20 to 30 TB per day, divided up against
several VTLs (Peterson declined to name the vendor). "We do about
450 MBps in each stripe across the VTLs. Anything that wants to fit
into our backup environment has to perform at SAN speeds."
Another user, David Silvestri, who asked that his company not be
named, said he was still trying to figure out where data
deduplication would fit into his environment from the opposite
angle -- in some cases, his remote offices are too small for it to
be practical, he said. "For 500 GB of data, there's no point to
having a Data Domain and backup software installation at my remote
site to back it up," he said. "There's a possibility that at least
15 of my remote sites are large enough to exist on their own, but
not to support a Data Domain box." For those sites, Silvestri said,
he was evaluating WAFS from Riverbed Technology Inc.
Another concern until recently, Silvestri said, was Data
Domain's product features. "We looked at them last year, when they
had just come out, and we weren't ready to jump on board right away
-- but now they have a lot of installations and built in
replication, and that's a lot more appealing."
Concerns over data integrity
Users are also apprehensive about introducing another product
into their environment that touches the data and could cause
corruption. "Most of our growth is in databases, critical databases
where we'd be concerned about anything in the background
manipulating the data," said Jim Norris, technical specialist with
Worldspan LP, a provider of travel technology services for travel
suppliers, travel agencies, and e-commerce sites.
According to Norris, the company backs up 2 TB per day, 1.5 TB
of it generated by databases. "It's our impression that any data
deduplication product should be aware of the backup software and
vice versa," Norris said. "For example, what if Tivoli Storage
Manager (TSM) sees the stub and backs up a file twice?"
Norris added, "DBAs want to keep multiple copies of the data --
in fact, there are policies and regulations that dictate that there
have to be full copies of the database, not even snapshots, for
data protection purposes."
"I have two fears [about deduplication]," ConAgra's Peterson
said. "One is that there's software keeping you from your data --
and if that software breaks, you lose all your data with it."
The other, he said, is that condensing the number of copies of
data on the back end makes the remaining single instance of the
data that much more crucial -- and in his mind, that much more
vulnerable. "It's like an anvil with all its weight on one small
point," he said. "If the dedupe product goes down, you don't just
lose one backup -- you lose them all."
A lack of evidence
According to Preston, there is another school of thought around
data deduplication, undertaken by Sepaton and FalconStor, which
deduplicates data "post-process" or after it's written to disk.
This can solve the issue of performance during writes to disk, but
can also at least theoretically pose a problem when deduplication
is performed in the background while other backups are being
written.
"It all comes down to the implementation," Preston said -- and
publicly discussed implementations of these products are still
relatively rare. "It's not even that the jury's out -- basically,
we haven't even seen the evidence yet."
Another point being raised in the dedupe wars is the issue of
forward vs. reverse referencing. Most of the dedupe players,
Preston said, do reverse referencing, which looks at a block up
front, compares it against data already in the backup repository,
and either keeps it or doesn't depending on whether it matches. But
forward referencing -- which currently is only done by Sepaton --
saves the most recent blocks and then compares it against older
data, a process Sepaton says improves restore times, since the most
recent backup version is kept freshest in its VTL.
But again, Preston said, "It remains to be seen if that is
true."
"I haven't really done enough research yet on Avamar," ConAgra's
Peterson said. "I find the idea of not sending duplicate data over
the wire appealing, but I haven't really dug into it yet."
"I'm not sure how EMC is going to use [Avamar]," said Mk Mokel,
an EMC user who asked that his company, a bank based on the East
Coast, not be named for legal reasons. "Whether or not we implement
it depends on if they leave it as a standalone product or as a
modular feature we can just add in."
Data Domain officials confirmed the 100 MBps performance figure
and said performance updates were on the way in the next year;
Diligent reps said adding clustered servers to the ProtecTier VTL
to boost performance is on the product's roadmap.