More and more storage professionals are looking to implement disk-based data deduplication, driven chiefly by the need to keep backups within windows. Increasingly, the inability to stream to tape quickly enough is leaving businesses exposed or is eroding network bandwidth as data protection creeps further into the working day.
However, data deduplication backup has still to make real headway among the vast bulk of the user community. It has made a lot of headlines, but IT departments are bound by budgetary buying cycles and do not rush new technology implementations without serious forethought of the implications.
Consequently, most UK storage and backup professionals are at a look-and-evaluate stage where they are coming to grips with what data deduplication backup can do and to explore differences between data dedupe products. That's the assessment of Tony Lock, programme director with analysts Freeform Dynamics.
According to Lock, businesses that have implemented data deduplication already are not confined to any particular sector. "It depends on need," Lock says, "although clearly some verticals, such as financial services, have the resources to begin early research into the technology. In general the UK market lags a little behind the US as that is a market more constrained by external factors such as compliance and, because most of the vendors are there, they start selling there sooner."
The convenience of writes and reads to disk and the power of data reduction offered by deduplication mean that UK businesses can now remove tape from the equation as nearline storage, says Clive Longbottom, service director, business process analysis, with analysts Quocirca.
"The main reason for the increasing rejection of tape is the low cost of disk and the use of virtualization, making logical tape based on disk a far better bet," says Longbottom. "Backing up to disk rather than tape is far faster, and so can manage far greater volumes in less time. Combine it with dedupe, and all of a sudden, backups that were taking more than a day now take an hour or so."
Backup is data deduplication's killer app. Because data deduplication can bring data reduction ratios of 10 times, 20 times, 30 times or more -- depending on the types of data being processed -- it can speed backup to disk hugely, with the added bonus of allowing far more information to be kept on disk and so shortening recovery time and stretching recovery point further back in time.
That was the experience of surfwear manufacturer O'Neill Europe, which cut backup times from 14 hours to two, slashed restore times, extended on-site retention to a year and made possible full backups of data where previously it had to select the most important. It made these gains by implementing a disk-to-disk-to-tape strategy with one Data Domain DD565 at the firm's headquarters plus four DD510s at other sites across Europe. O'Neill is backing up 57 TB of data to just 3 TB of disk space and getting a dedupe ratio of 18:1.
Peter Maljaars, global IT service and infrastructure manager at O'Neill Europe says, "We're actually now backing up more data than we did before -- 5.3 TB instead of 1.4 TB. Before, we just couldn't back everything up -- it would have taken 20 or 30 hours if we included all the image files, which are important but we can do without. We used to have to decide what was most important. Now we can back everything up and don't have to make that decision."
Besides the core benefits of reducing backup times and enhancing RPOs and RTOs, data deduplication backup also reduces dependence on tape. This not only means that less has to be spent on tape equipment and media but also removes the need for the human hand in management of tapes. This is a common source of data retention anguish – we've all heard scary stories about the security guard, admin person or salesperson at the branch office forgetting to take the tape out or being off sick.
Because the amount of data that has to be moved is drastically reduced by data deduplication this also lowers the potential load on the WAN link, meaning opportunities for off-site backup, replication and disaster recovery come into view.
But can you benefit from data deduplication? To get an answer to that question it pays to look under the bonnet and see how data deduplication backup works. Results vary a lot depending on your environment and the product used.
Essentially, data deduplication is a form of data reduction, with duplicate blocks of data removed and replaced with a pointer to the original instance generated by a mathematical algorithm during inspection of files and their component parts. For that reason data deduplication is, at its most basic level, suited to data types that contain a lot of repeated patterns, such as database files and email data.
Conversely, data formats that contain little in the way of repeated information offer little scope for impressive deduplication ratios because there is not much redundancy to be eliminated. Ratios achievable vary from 10 times to 50 times or more, dependent on the data being processed.
Deduplication achieves higher ratios over time. The technology relies on being able to spot repeated patterns, so – assuming some homogeneity of data type – it will shrink backups far more after several weeks than it will after only a few days. Over a longer period it simply doesn't have the opportunity to point to already existing identical patterns of zeros and ones. So, if your data types don't vary too much, you will achieve good ratios. If not, then it may not be for you. With data deduplication it's a case of your-mileage-may-vary, Freeform Dynamics' Lock says.
"It all comes down to the state of the data you have," Lock says. "Some organisations will have lots of duplicate data – 47 instances of one PowerPoint presentation emailed to 47 team members, for example – and deduplication can cut this to one with 46 stubs pointing to the original. The length of time over which data is deduplicated will also bring variations in results, especially as deduplicated data will be able to be kept for longer on disk and so will make the process of eliminating redundancy more efficient."
If your data profile means the technology is something you can profitably use, there are then further questions to ask. These are mainly about what type of data deduplication tools are best for you.
For now, nearly all data deduplication products are either backup applications, virtual tape libraries or NAS boxes.
These are also isolated and limited examples of deduplication in primary storage. Most experts think that primary storage for high-transactional data simply cannot stand the invasive processing load that deduplication entails.
Data deduplication products come as hardware appliances and as software. While software data deduplication products are less expensive than data dedupe hardware appliances, they put a heavier load on host CPUs and are more difficult to maintain, given their likely multiple dependencies on other software, such as backup applications, as well as changes in their immediate environment. Software products are generally located at the backup server, which means they can cut bandwidth requirements.
Another key product differentiator is between inline and post-process (or out-of-band) data deduplication. The former processes data as it passes through the dedupe product before it reaches the target disk – so can cut down on bandwidth used by backup traffic -- while the latter takes in the data and then deduplicates it, and so requires more disk.
"It's a simple choice of speed versus space," says Chris Reid, managing consultant with integrator Morse. "Inline deduplication requires the least space because excess data is stripped away as it arrives on the system, but this is a slower process, which impacts backup windows. Post process deduplication is faster but requires much more space. By taking a disk backup and then deduplicating the data it places less strain on the system during the task and also ensures that there is at least one full copy of the last backup on disk, which can help with restoration."
Testing data deduplication products is vital. Since deduplication products are affected by data type and environment, realistic testing of the types of backups and restores that your environment is likely to bring is the only way to assess products, Lock says.
"You need to assess whether the product will do what you want it to in your environment and with your data," Lock says. "So, you need to test with data sets that are analogous to those that it would encounter in a live situation and over a period of time that can allow the deduplication process to begin to gain traction. Also, when comparing products you need to ensure you're running horses over the same course, so subject all products to the same tests."
Backup data deduplication techniques for SMEs