Preparing for 'little disasters' often neglected

Tom Dugan, CTO of Recovery Networks, says it's common for users to prepare to lose an entire building -- but aren't ready for small-scale losses – like losing a server.

Tom Dugan, chief technology officer of backup services provider Recovery Networks, says it's a common problem he's seen among the users he consults with on a daily basis: they are prepared for the big disasters, like hurricanes or terrorist attacks, but can be totally at a loss when something as simple as a server crashes.

In this Q & A, Dugan shares customer anecdotes and words of advice about how to better prepare for everyday catastrophes.

What's the biggest problem remaining in disaster recovery?

More info on disaster recovery
Midsized firms reach out to backup service providers

Introducing disk-to-disk-to-disaster recovery

Disaster recovery planning: Special report

Disaster recovery overview
Tom Dugan: When we first started Recovery Networks, we surveyed a bunch of companies to find out if they had DR [disaster recovery] plans. Those that did have plans covered the "disaster" events -- a building blowing up or catching fire, a big flood, etc. -- but they didn't cover the more likely "disasters" like a server dying or a database corruption, and so on.

As our director of sales and marketing said to a client: "Your building has a 99.99999% uptime since it was built in 1984. Your servers have a 99.9% uptime since they were installed three years ago. Why are you spending so much money on a solution to cover a .000001% chance of something going awry yet ignoring the 0.1% probability?"

Do you continue to see this happening? Where?

Dugan: Yes, absolutely. One company had 600 servers -- their disaster plan was that if the building blew up, they would failover to Minneapolis. So I asked them, "What do you do when one server dies?" And they said, "we tell the IT guys they have to rebuild the server, and then we restore the data from tape." That's the answer most people have -- that's the only methodology they use.

Another example is a company that used SunGard for mainframes and AS400 applications but had no disaster capability for Windows servers. That makes sense -- it is hard to rebuild mainframe and AS400 systems. But then it turned out that they had 300 Windows servers, compared to a total of 13 AS400 and RS6000 systems. This company put all their eggs into the mainframe basket -- but if you have to rebuild, from scratch, 100 Windows servers, that's still a pretty daunting task.

How much impact can a 'little disaster' really have?

Dugan: Here's an example -- one customer of ours did payroll for all the education systems in an entire county, on one server. That system went down on a Friday, and it took all weekend until Monday morning to get it back up. They had to run payroll all afternoon, and bus drivers and teachers didn't get paid on payday. You couldn't declare a disaster for that -- it wasn't a physical disaster. But it was bordering on a political disaster there, and would've been a PR disaster if they hadn't got that system back up and running.

Having a "big disaster plan" is good and important, but it's like having life insurance but no car or home insurance. If one server fries, sure, you can get someone to fix it, then rebuild the operating system, security patches and updates, applications and then restore data -- but that 24-36 hours could be a major problem.

So for those short-term kind of disasters, so to speak, what do you recommend?

Traditionally, the technical answer to the server crashing is to cluster servers. General applications like SQL, Exchange and Oracle are cluster aware. But vertical-market applications, like accounting and legal programs, tend not to be. And while clustering is one solution to consider for single-server failure, for 15 or 20 servers, you're going to need 30 or 40, and it gets expensive.

If servers can't be clustered, sometimes people just say a little prayer on their way to work. That's it.

What can customers do in the case of servers that can't be clustered?

Dugan: VMware comes into play a lot of times in my opinion. It's a reliable solution for most of the situations where it's OK to take time to rebuild the whole server, but it's a lot quicker than restoring from scratch and from tapes. VMware can pre-create that SQL server, let's say, it can provide a snapshot in time. You don't have to invest in 20 physical servers -- invest in two and just have virtual servers running on those systems, just disk files sitting there waiting to be used. The manual labour is still there to regularly update the copy of the production server, but minimises overall cost and rebuild times in the case of a corruption or failure. With enough disk space on the physical server, you can back up and restore the data using the virtual machine in the event of a crash too.

It's also the problem Recovery Networks is trying to solve for our clients -- that single or two-server disaster. Customers can use a service like ours to back up data at our facility, and if they lose a server, they activate the one on our end, virtual private network the two together and they're up and running. We offer different levels of service, too, to make it more flexible and affordable.

Read more on IT for small and medium-sized enterprises (SME)