Whenever SearchStorage ANZ talks to industry about disaster recovery and business continuity plans, we are assured that with prudent planning and sufficient attention to detail, large outages simply should not happen.
As you probably know by now, the airline’s reservations system has crashed. Staff were left to use manual processes to get people onto planes, these required more time than was needed to board passengers and massive delays ensued after a reported blowout in the time expected for cutover from production systems to backups stretched from an expected three hours to an agonising 21 hours.
The airline has since issued a statement that cryptically says “... solid state disk server infrastructure used to host Virgin Blue failed ...”
We’re not entirely sure what a “solid state disk server” is. Google produces no direct results on a search for the term, other than Virgin’s statement, so we are pretty sure the airline's statement is fibbing a bit!
So what is the source of the fault?
We don’t think it’s servers
Despite Virgin Blue’s statement using the phrase “solid state disk server” we don’t think servers are the problem.
Navitaire, the provider of the reservation system, is something of a software-as-a-service play. We find it unimaginable that it doesn’t have servers galore, and server images galore, just waiting to be pressed into operation. In extremis, it could even nip out and get a new x86 server: we're pretty sure the top tier vendors can scramble one out of their warehouses in a screaming hurry for their best clients on high-level support contracts. Like global providers of airline reservation systems.
There’s a fair chance it’s a virtualisation user to boot – who isn’t these days? – so rebuilding a server should not take the reported 21 hours. We think we can rule it out.
Solid state disk failure
A task that can and often does take 21 hours is restoring data.
So what does Virgin mean by “solid state disk server”? One possible translation is that it could be solid state disk in a server, a plausible scenario given that an application like a reservations system can probably use as much I/O as it can get its proverbial hands on. A solid state disk would move things along nicely.
This theory holds a little water, as if you read further into the airline’s statement, as it says Navitaire was able to “isolate the point of failure to the device in question relatively quickly,” but that “an initial decision to seek to repair the device proved less than fruitful and also contributed to the delay in initiating a cutover to a contingency hardware platform.”
If an SSD in a server failed, it would be easy enough to figure that out and swap in a new one, or try to restore the integrity of the data it contained.
But it’s a little odd that a mission critical application like this did not have at least one redundant drive in place.
But what about those lovely big RAMdisks that connect to the PCI bus? They're often described as "disks" and they are certainly solid state. One use case for RAMdisk is to load a database so that it goes faster. The nature of the devices means they're not as likely to have redundant backups.
For the sake of argument, let’s assume there was just one SSD in the application’s main server and that the physical failure of the drive led to data corruption. The data in the system had not been written to magnetic disk or been copied into a RAID set for a while.
At this point the Navitaire team was faced with a tricky chore to restore data - including recent transactions - and get it running on the backup rig. That could account for the 21 hour delay.
What about the SAN?
The “extreme rebuild” scenario outlined above is, we believe, plausible.
But it’s more plausible if you factor in a little extra complexity. A single disk, after all, is relatively easy to restore.
A storage array is another matter entirely. Imagine a whole set of disks, solid state or otherwise, in an array striking a problem. Or a critical part of the array hitting a snag, like a disk drawer failing for some reason. That could easily account for the 21 hour cutover time.
As it happens, a search for “Navitaire and SAN” brings up this case study about Navitaire mentioning NetApp and StorageTek as storage infrastructure providers.
The case study is an oldie: StorageTek was acquired by Sun a few years back and its brand has mostly disappeared. Some of the NetApp products it mentions are at end of life.
But one of the products – NearStore – seems very much alive.
We’ve called NetApp’s Australian PR representatives and they are not, at this time, able to confirm that Navitaire remains a user of its products.
But if we imagine that there is a SAN – from NetApp or another vendor – and that solid state disks failed in a SAN, that could explain the long time to restore service.
That’s our best guess at what’s going on – at least until someone explains what a “solid state disk server” is.
UPDATE 10:00PM September 28th
SearchStorage ANZ has found this story from 2009 that suggests Navitaire remains a NetApp customer.
We're yet to hear from NetApp.
UPDATE: Noon September 29th
NetApp has issued a statement in which it says its products have nothing to do with the outage.