Disaster recovery planning is all about good preparation and not being fooled by suppliers whose main aim is to sell products, rather than ensure you have the most cost-effective and rounded approach to recovering from unplanned outages.
Toigo, who is CEO and managing principal of Toigo Partners International and chairman of the Data Management Institute, emphasised a routine and structured approach to disaster recovery planning in which key people in the organisation know exactly what they need to do and when to do it if disaster strikes.
He said: “You need to train a cadre of people that can think rationally in the face of great irrationality. There’s no place for micro-managing in a disaster.”
He contrasted this conscious and planned approach to disaster recovery with the pressure from suppliers to purchase products marketed as a panacea for disaster recovery.
“You can buy something from a vendor and they’ll say, for example, ‘high availability trumps disaster recovery’. Well, if you believe that, there’s a bridge in Brooklyn I’d like to sell you.”
The Florida-based disaster recovery expert set out three key areas to be tackled in recovering from a disaster. These were: restoring data; re-hosting applications and re-connecting networks.
Read more about disaster recovery
Toigo argued that there is no one-size-fits-all approach to disaster recovery. Instead he set out a sliding scale of disaster recovery provision in terms of technological approaches; a so-called “recovery spectrum”.
At the least costly and slowest end of the continuum in response time terms is traditional backup and tape as a storage medium. This option entails restoring data, re-hosting apps and re-connection of networks.
Next there is WAN mirroring and/or continuous data protection (CDP), which is more costly but – with its more frequent protection point - removes the need to restore data, but restoration of apps is still required.
Finally, and the most costly option, is near real-time WAN failover between sites. This removes the need to restore data and apps, but will likely see the need to reconnect the network.
Defence in depth
Toigo’s key concept in preparing for disaster was “defence in depth”, where the cornerstone is retention of copies of data far enough away from the primary site with provision for people, power, water, networks, and so on, taken into account.
Toigo also pointed out that most disasters are not sudden events, but ones whose approach can be seen and measures taken to deal with them before they become a catastrophe, such as taking last-minute backups, shutting down systems in an organised way and notifying key people.
How data is retained is entirely dependent on the importance of the data and the time required to make it useable again. The key here is to match data to the appropriate storage technology, ranging from tape backups to real-time failover with dual hot sites.
In explaining his concepts of defence in depth in disaster recovery, Toigo highlighted some technologies that are relatively neglected among the vendor community but in which he sees usefulness. These included: storage virtualisation and linear tape file system (LTFS).
Storage virtualisation comprises hardware and software products that place a virtualisation layer above heterogeneous storage media. These can be disk arrays, direct-attached storage, white box JBODs etc; the storage virtualisation product creates a pool of storage from them and allows volumes to be created.
Using the higher level storage services that often come with these products such as replication and thin provisioning, Toigo sees storage virtualisation as a way of being able to create multiple datacentre sites that are functionally identical, without having to buy equipment from one vendor.
Products include DataCore and FalconStor software, IBM SAN Volume Controller and NetApp’s V-series hardware.
Meanwhile, Toigo advocated the use of LTFS as a production tier or nearline, relatively quickly accessible archive that enables fairly rapid data restore. LTFS places a NAS-like head in front of tape libraries that provide a file system, and therefore quicker access – in the order of seconds – to data held on tape.
Toigo argued that such technologies provide reasonably-price and effective methods of ensuring data is protected in case of disaster and were suited to datasets in many cases outside the most critical and time-sensitive.
In setting out his philosophy on disaster recovery Toigo was cautious, and in some cases scathing, about a number of contemporary technologies.
Among these were:
HDD vs Tape
Pointing at the IT supplier community’s predilection for declaring “tape is dead”, Toigo highlighted the inherent problems with HDDs.
“There are only two types of hard drive; those that have failed and those that will fail,” he said, pointing to the statistical likelihood that one-in-90 SATA drives will fail and that storage arrays are often built with disk drives that have come off the production line sequentially, and therefore are more likely to fail for similar reasons. “'Tape is dead' is the worst thing that has happened to disaster recovery,” he said.
Toigo also warned against seeing the cloud as a panacea to disaster recovery needs, or even reliance on it as a repository for duplicate data. The chief reason here is, he argued, its inherent unreliability, especially in terms of restoring data over dubious last mile connections.
“No-one ever checks to see if the cloud service is really there... or check if the vendor can return data by tape,” he said.
The data you backup to a cloud provider can soon build up to quite large amounts, was his argument, and that can soon accumulate to volumes that are too large, with 10TB taking more than a year to travel a T1 connection, or 4 hours by OC192, “But you can’t afford that,” said Toigo.
Server virtualisation and disaster recovery
“You don’t fix stupid by virtualising it,” said Toigo, who argued that server virtualisation can cover a multitude of sins, especially where data protection is concerned.
“There’s an over-reliance on mirroring between VMs. This can be difficult to test and validate and doesn’t account for data in flight, in RAM etc. Most interruptions are messy interruptions. The only clean failover is one that’s done deliberately,” he said.