How a flash array could have limited RBS damage

The RBS banking system failure shows that its customer account database should be in a flash array to ensure there’s no major bottleneck in the way of rapid recovery.

RBS recently had a very public failure of its banking system with customers unable to access their money or account information. It started on 19 June and was still ongoing on 25 June, with heartbreaking stories, such as house purchases falling through because money couldn't be transferred. Many millions of its nearly 17 million customers were affected by a software problem, thought to be related to mainframe batch job control software.

What has been surprising is just how long it took to resolve the problem. I would suggest that one aspect of the system's architecture that has contributed to this is its use of disk storage where a flash array should have been in place. (I don’t have confirmation that RBS doesn’t have a flash array, but there are multiple indications that it doesn’t.)

My thinking starts from the idea that there is a database with 17 million customer account records and a file of updates to be run against that. These are regularly scheduled direct debits, timed money transfers and that sort of thing, separate from online banking transactions through ATMs and customers' PCs.

Ramping up GPFS with SSD

IBM shows the potential of parallel processing backed by SSD

Let's assume that there are a million of these file-based transactions to be applied to the database. If a day goes by in which they are not done, then the next day there are 2 million, the next day 3 million and so on.

Each transaction takes a finite amount of time as a record is located in the database, read in from disk, updated and written back. And unless corrupted batch processing is corrected, pretty soon the backlog will build up so much that there will not be enough time in an overnight batch run slot for it to complete.

RBS boss Stephen Hester has said that what happened is unacceptable. Mistakes and errors will occur, but with computerised banking systems looking after millions of customer records, the reliance on spinning hard drives to store data begins to look like a bottleneck waiting to happen. Bringing customer records into and out of spinning disk, using solid-state storage as cache (which, I’m surmising, RBS is doing), is the wrong approach. There is a faster way to deal with them; they should reside permanently in a flash array.

The customer account database should be an in-memory database. Solid-state storage arrays from suppliers such as Texas Memory Systems (TMS) or Kaminario should be used to hold it. Updates to it will take place in a fraction of the time needed for disk-based database records; there will be no disk seek and track read time latency and less network latency too if the in-memory database system is located close to the mainframe processors.

There are precedents for this. Penn State University reduced nightly backup times of a General Parallel File System (GPFS)-based system with millions of files from six hours to one hour with a RamSan flash array from TMS. Banco Azteca in Mexico uses RamSan systems with up to a terabyte of capacity. The move to a flash storage array took place because, as the bank's data traffic increased, it began experiencing performance problems with transactions because of degradation in response times from its enterprise storage array. These led to problems with application stability as, in periods of heavy load, the application was prone to lock up and deny features to bank customers.

After the move to solid-state, the bank's director of systems, Juan Arevalo Carranza, said, "We have had excellent results, really spectacular, because now we can process a huge number of transactions in one day with lower response times.”

RBS would need a terrific increase in transaction speed to recover quickly from a big foul-up: to recover in a day or less rather than a week or more. Relying on spinning disk is like forcing an Olympic sprinter to run in a three-legged race. The likelihood is that the size of banking customer account databases will grow and the number of transactions will grow as well. Both of these things will exacerbate the difficulties IT staff face when dealing with the next banking system failure and cause even more customer disruption.

The RBS banking brands -- RBS itself, NatWest, Ulster Bank and so on -- will have had major damage done to their brand perception. Affected customers that have suffered reputational damage through not paying their staff or customers will blame RBS. The costs of fixing the problem and compensating customers for their added expenses will run into millions of pounds. RBS has made itself look incompetent and slow.

Surely, there is a strong case for the basic computer architecture of its banking IT system to be looked at and the single biggest bottleneck in the recovery from a banking system failure to be identified. In this case it is fairly obvious that reliance on spinning disk is that bottleneck.

These are mission-critical systems, and that means they should not fail, but if they do fail recovery has to be as rapid as possible.

For RBS the recovery time objective (RTO) from a disaster such as this needs re-examining as recovery for overwhelmed disk-based databases is far too slow. It needs to be completed in hours, not days.

Confidence in RBS is perceptibly weakening, and that’s the last thing the bank needs. It needs to strengthen its operations with solid-state, not more spin.

Chris Mellor is storage editor of The Register.

Read more on SAN, NAS, solid state, RAID