DB2 failure prompts bank to set up extra disaster recovery

After software bugs took its systems down last month, Danske Bank is reluctant to rely solely on IBM recovery tools.

After software bugs took its systems down last month, Danske Bank is reluctant to rely solely on IBM recovery tools.

Danske Bank has taken steps over and beyond industry best practices to avoid a failure in its IBM Geographically Dispersed Parallel Sysplex mainframe system.

Last month Danske Bank was shaken by a series of software errors in its DB2 installation following a hardware failure at a datacentre in Ejby, Denmark. Key trading systems ground to a halt on 10 March and the bank was not fully operational again until 17 March.

Reflecting on the traumatic series of events that affected the IT operations at the bank, Peter Schleidt, executive vice-president at Danske Bank, told Computer Weekly, "We need to be more independent from having to recover with IBM's tools - even if that means we have to do more than what is recognised as best practice currently in the market."

One of the best practices for implementing disaster recovery in mainframe systems is to implement synchronised mirroring to replicate data. Schleidt said, "Even with a synchronous mirror, we will not be safe from errors in DB2 in the future."

For disaster recovery, the current set-up is based on IBM's GDPS emergency security system. This provides mirrored discs based on IBM Ramac Virtual Array technology to supplement Danske's existing two-centre IT operations. But Schleidt said the current GDPS installation does not provide adequate protection.

"We have learned from IBM that the current version of GDPS would not have solved the problem [with the RVA disc subsystem]," he said, adding that version 2.8 of GDPS, which will be available next month, should prevent the failure of the RVA disc subsystem in the future.

The IT problems at Danske Bank were triggered by a failure in the RVA disc subsystem following an electrical outage during routine disc maintenance. This resulted in data inconsistencies in its DB2 system.

Schleidt said the bank would be taking additional steps to maintain data integrity in the future, rather than merely following industry best practice guidelines for implementing disaster recover measures in a GDPS installation.

"Based on our current experience with DB2, we have decided to develop asynchronous-based mirroring to our availability centre." He said this would allow the bank to run critical functionality, such as online and batch processes, at the datacentre even when the primary GDPS is down.

Schleidt said the real problem the bank faced was that DB2 became inconsistent after a hardware breakdown. This was the first of four errors Danske Bank faced with its DB2 set-up.

In a statement, the bank said a second software error delayed the recovery process, as several DB2 tables could not be started, and a third error on the system prevented recovery jobs from being run simultaneously.

Another software bug stopped the corrected data from being reloaded into the databases. The bank said, "This last error, which appeared on Thursday 13 March, resulted in new episodes of inconsistent data that had to be recreated by other methods."

To avoid further delays in restarting the system, the bank decided not to wait for IBM to issue patches and used back-up data from its operations centre in Brabrand to restart the system. It was up and running four days after the initial failure. However, it took the bank until 17 March to clear all the transactions that had accumulated while the system was not operational.

Commenting on the failure of the DB2 system, a spokesman for IBM said, "IBM issued immediate updates to its DB2 product to ensure that other customers avoid the software difficulties that were experienced at the bank."

IBM said the circumstances that led to the problems at Danske Bank were highly unusual.

Whether the DB2 problems experienced by Danske Bank are more widespread remains to be seen. What is clear is that in any computer outage IT staff are called on to work at a frenetic pace to put systems back online.

Julie-Ann Williams, chairwoman of the large systems working group at IBM user group GuideShare Europe, said users need to pay particular attention when recovering a DB2 system after a failure. "Unless you have a regimental approach to recovering DB2, you can easily slip up," she said.

With IT staff under increasing business pressure to restore IT systems, Williams warned that it was easy to miss an essential step in the recovery process.

What is synchronised mirroring?

Usually deployed in geographically dispersed sites, synchronised mirroring involves writing data to two disc subsystems simultaneously. If one disc fails, the mainframe can be hot-swapped onto the second disc.  With asynchronous mirroring, it is necessary to wait for the disc to flag that the data has been successfully copied over before it can be used. In an IBM installation, Geographically Dispersed Parallel Sysplex complements a multisite Parallel Sysplex mainframe clustering implementation by providing a single, automated system to manage storage subsystem mirroring (disc and tape), processors, and network resources. It is designed to allow a business to attain "continuous availability" and "near transparent business continuity (disaster recovery)" without data loss.

Read more on Business continuity planning