Tip

Inline deduplication or post-process: Which one is right for your environment?

By Antony Adshead, UK Bureau Chief

When selecting a data deduplication product, one of the main choices you'll need to make is whether to go for inline deduplication

    Requires Free Membership to View

or post-process deduplication. The answer depends on a number of variables.

Data deduplication is the removal of duplicate data blocks and their replacement with a pointer to the first iteration of that block. For that reason, data deduplication's key challenge is its processing overhead, especially in the case of large data sets.

When to use inline deduplication and post-process deduplication
Inline deduplication is better when:

 

  • There is a restriction on disk capacity at the target.
  • Your backups regularly contain large amounts of redundant data.
  • You face restrictions on bandwidth between locations, such as to disaster recovery sites.

Post-process deduplication is better when:

 

  • There is no obstacle to investing in the disk capacity required.
  • You want to copy data from disk to tape soon after backup.
  • You want to minimise the impact on your existing environment.

Inline deduplication and post-processing deduplication are defined by when they process data to remove duplicated elements. Let's look at how each one works and what scenarios they are best suited to.

Inline deduplication requires less disk space

Inline deduplication looks for duplicate blocks of data as the data is ingested to the target device.

This method of data deduplication requires less disk space than post-process deduplication because duplicate data is removed as it enters the system. The drawback is that deduplication processing at this point creates a bottleneck that can affect the length of the backup window.

But that same attribute can bring advantages. If, for example, your business will regularly back up large quantities of data which contain many duplicate blocks that don't change, then inline deduplication could be best.

Inline deduplication products recognise redundant data as it comes in from several different backup data streams and won't forward the full block to target media if it knows it must be duplicated. Performance will usually improve over time as the data deduplication product becomes used to the data set it has to work with and data reduction ratios will improve toward an optimal maximum.

Post-process deduplication speeds backup

Post-process deduplication also looks for duplicated data blocks and replaces them with a pointer to the first iteration of that block. But unlike inline deduplication, post-process deduplication doesn't begin processing backup data until after it has all arrived at the backup target. So, if you regularly back up large amounts of redundant data, you'll be using resources to pump it all into the backup target unreduced.

That means that a primary requirement of post-process data deduplication is that there's enough disk capacity to store the largest potential backup your business is likely to carry out.

If you want to minimise backup times, post-process deduplication could be your best bet. Post-process deduplication has no impact on the length of the backup window. Backup operations are completely unaffected, with data deduplication carried out once the data set is on the target disk.

However, if you need to replicate deduped data to an offsite location for disaster recovery (DR), it may work out better to go with inline deduplication instead of sending un-deduplicated data across the wide-area network (WAN) and then dealing with it post-process.

Impact of data deduplication on tape backups

Post-process deduplication can be the better choice when you plan to copy data from disk to tape soon after backup. For that reason, if you want to copy to tape soon after backup, it can be done more easily if you have the un-deduplicated data set to hand, as you would in a post-process deduplication scenario.

It's usually recommended that data not be copied to tape in its deduplicated state because of potential problems that can occur when attempting to restore data sets in which the original iterations of data blocks are scattered across a number of tapes.

It's for this reason also that if your environment contains a mix of disk and tape, post-process deduplication may be better suited to your purposes. There's less disruption to backup and archiving processes from a data deduplication method that carries out its work as a discrete stage rather than incorporating it with backups, as does inline deduplication.

 

Email Alerts

Register now to receive ComputerWeekly.com IT-related news, guides and more, delivered to your inbox.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

This was first published in March 2010

 

COMMENTS powered by Disqus  //  Commenting policy

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.