Whether your backup environment is small or large, two metrics to examine are the presence of recurring failures and tape drive performance.Maintaining a small backup environment can be simple. One tends to become familiar with each individual backup job: when it runs, when it roughly completes and how much data it will secure. You also tend to have good visibility into backup failures, and therefore know when they're succeeding. Essentially, you have a good grasp of what your recovery point is for each server you're responsible for. But larger environments are difficult, if not impossible, to maintain with such intimacy -- there's no way you can remember the details of 20,000 backup jobs. You're unlikely to know offhand whether a backup that has failed has done so for the first time or the tenth time. This is why accurate reporting mechanisms are essential.
I was recently informed of an issue whereby some data had corrupted and required restoration from tape. Sadly, there was no tape copy to be found. The backups were configured and working, but they were also running outside of the backup window and being terminated automatically. It wasn't uncommon for a percentage of backups to be terminated in this fashion each day, so no special attention was paid to this particular client.
Additionally, the reporting mechanisms were based upon a percentage success rate. The environment frequently achieved success rates in excess of 99.7%. However, statistics tell only part of the story. The 0.3% of failures in the scenario above could represent terabytes of data and perhaps as much as 10% of business-critical information.
There are two metrics I examine when analysing a backup environment:
- The presence of recurring failures. Most organisations can cope with the odd failure; if there are tight recovery point objectives (RPOs), data can often be recovered from other sources (e.g., database log shipping and subsequent roll forwards). But recurring failures often represent the largest risk. In the situation above, the host with the corrupt data had routinely fallen into the 0.3% of failed backup jobs and, as a result, reached a point whereby restoration was impossible.
- Tape drive performance. If you know from the above that you're securing your data, the next question to ask is "How efficiently are you driving your hardware?" From this, you can deduce how you can get the most from your environment and at which point it needs a capital investment.
There are an ever-growing number of products available that can provide detailed reports of your backup environments, regardless of the backup application. Purchasing the correct tool and configuring it to deliver all of the information your organisation requires is essential. It's important that you don't rely on statistics as the sole indicator of an environment' success because that approach doesn't highlight recurring problems.
About the author: David Boyd is a senior consultant at GlassHouse Technologies (UK), a global provider of IT infrastructure services. Boyd has more than seven years of experience in backup and storage, with a major focus on designing and implementing backup solutions for blue chip companies.