kjekol - Fotolia

Backup failure: Four key areas where backups go wrong

We look at the key ways that backups can fail – via software issues, hardware problems, trouble in the infrastructure and good old human error – and suggest ways to mitigate them

Despite the prevalence of many forms of data protection – that range from local RAID via snapshots and replication to keeping copies in the cloud – the facts of life are that backup is still fundamental to IT.

That’s because no matter what other methods you use – snapshots, replication, and so on – any corruption to data is copied with it, so a good library of backups going back as far as possible is needed to roll back to.

But backups fail – and, according to a recent survey, the failure rate is a staggeringly high 37%.

Backups fail for a variety of reasons, and in this article we survey the key causes. Some are unexpected and not particularly avoidable, such as hard drive failure, but which can be mitigated.

Others can be expected, and can be mitigated, such as issues that arise after patching or changes to other configurations.

Then there is the human element, such as making sure you set up backups correctly and knowing how your backup software works.

Media failures

Hardware is an awkward fact of life in IT. Awkward because it can fail. For hard drives, failure can come unexpectedly, although for spinning disk it is known that something like one in 100 will pack up annually.

So, backups can fail because drives fail, but that can and should be mitigated by redundancy such as via RAID. Solid-state drives (SSDs) can fail too, though at a lower rate than hard disc drives (HDDs), although SSDs have more of a limited lifespan. Here, again, the key is to build in hardware redundancy and refreshes.

Tape has its own failure mechanisms, centred on the effects of time and use on magnetic media itself and its relationship to reading heads.

Manufacturers’ instructions for tape retention and maintenance should be followed, and peculiarities of the media should be noted – reading heads and media, for example, can wear together in ways that may not show until you want to recover data to different equipment.

The big takeaway when it comes to avoiding hardware failures that affect backups is to build in redundancy, including through something such as 3-2-1 backup.

Software issues

Issues surrounding software encompass a large gamut of potential issues that can affect backup. One of the most common sources of backup failure is when changes brought about by upgrades or patching cause issues the next time a backup runs.

That can be because upgrades or patches – that often comprise very large numbers of changes to software – can create incompatibilities with the backup configuration. This can include changes to applications that mean something is now unsupported somewhere in the stack and security updates that change or reset settings that make it impossible for backups to connect.

The key method of mitigation is to be aware that updates are set to take place and to be ready for the type of disruptions to backups – and elsewhere – that can occur. Some suppliers’ predictive analytics platforms may help by being able to foresee potential issues with particular configurations of update and software installed.

Sometimes backup software itself can fail. Issues can include services associated with the application failing to run, agents failing to install correctly, connection problems, read/write errors and even things such as daylight savings time changes affecting backup Window settings. Here you need to check the suppliers’ support resources for solutions.

Virtualised environments can bring their own particular problems. The creation, migration and decommissioning of virtual machines (VMs) and their data necessarily involves many changes and backup software needs to keep track of a potentially very complex landscape.

Failures surrounding backup can be caused by corrupt catalogues, insufficient permissions and things such as Volume Shadow Copy Service (VSS) failures and virtual hard disk (VHD) corruption.

The emergence of widespread use of containers is likely to bring its own further complications due to their rapidly moving lifecycles.

Human error

It’s a basic fact that humans are responsible for overseeing the deployment and operation of backup processes, no matter how automated, so there’s always scope for human error in the process. The key is to reduce the likelihood of it affecting your backups.

Configuration of backups, knowledge of the backup product(s) in use and the tools they include that can help automate tasks is the starting point. Getting configuration right and knowing how to use built-in tools to discover, data sets, applications, services and other dependencies is key to successful backups – and, perhaps more importantly, to successful restores.

After all, backup is nothing without the ability to recover data, whether that’s a single file or an entire system. It’s more likely in the latter case that you’ll need to be aware of critical dependencies and to have ensured they are protected and restorable.

Here, the supplier may have some useful discovery tools, but be careful that you know what they may not have discovered in terms of dependency. A core application may have dependencies such as access control that will be vital to getting it running again, for example.

A key method to being prepared when it comes to the human element is to carry out regular testing and to build policies and procedures to cover things that could fall through the gaps that machines cannot deal with.

Infrastructure failures

Backups have to traverse all sorts of infrastructure so a failure anywhere can affect backup and recovery, with potentially the latter being even more vulnerable.

Infrastructure can encompass tape drives and libraries, disk arrays, backup servers, networks and increasingly your link to the cloud.

Key to mitigation of infrastructure issues is, once again, redundancy. So, for the parts of the infrastructure that you manage, make sure to have redundancy built in, whether at the level of media, servers or connectivity.

For those you don’t have direct control over – such as wide-area network (WAN) connections, cloud resources, and so on – clear service-level agreements (SLAs) need to be in place. And make sure that infrastructure is in place to effect a return to working should disaster strike.

Since the pandemic, the huge increase in the need to support remote working will have thrown infrastructure issues into sharp relief. It brought the need to look at the ability of existing software products to handle edge device backup, or even the need to procure a specialised product for this task.

Read more about backup

Read more on Storage management and strategy

Data Center
Data Management