RBS recently had a very public failure of its banking system with customers unable to access
their money or account information. It started on 19 June and was still ongoing on 25 June, with
heartbreaking stories, such as house purchases falling through because money couldn't be
transferred. Many millions of its nearly 17 million customers were affected by a software problem,
thought to be related to mainframe batch job control software.
What has been surprising is just how long it took to resolve the problem. I would suggest that one
aspect of the system's architecture that has contributed to this is its use of disk storage where a
flash array
should have been in place. (I don’t have confirmation that RBS doesn’t have a flash array, but
there are multiple indications that it doesn’t.)
My thinking starts from the idea that there is a database with 17 million customer account records and a file of updates to be run against that. These are regularly scheduled direct debits, timed money transfers and that sort of thing, separate from online banking transactions through ATMs and customers' PCs.
Ramping up GPFS with SSD
IBM shows the potential of parallel processing backed by SSD
Let's assume that there are a million of these file-based transactions to be applied to the
database. If a day goes by in which they are not done, then the next day there are 2 million, the
next day 3 million and so on.
Each transaction takes a finite amount of time as a record is located in the database, read in from
disk, updated and written back. And unless corrupted batch processing is corrected,
pretty soon the backlog will build up so much that there will not be enough time in an overnight
batch run slot for it to complete.
RBS boss Stephen Hester has said that what happened is unacceptable. Mistakes and errors will
occur, but with computerised banking systems looking after millions of customer records, the
reliance on spinning hard drives to store data begins to look like a bottleneck waiting to happen.
Bringing customer records into and out of spinning disk, using solid-state
storage as cache (which, I’m surmising, RBS is doing), is the wrong approach. There is a faster
way to deal with them; they should reside permanently in a flash
array.
The customer account database should be an in-memory database. Solid-state storage arrays from
suppliers such as Texas Memory Systems (TMS) or Kaminario should be used to hold it. Updates to it
will take place in a fraction of the time needed for disk-based database records; there will be no
disk seek and track read time latency and less network latency too if the in-memory database system
is located close to the mainframe processors.
There are precedents for this. Penn State University reduced nightly backup times of a General
Parallel File System (GPFS)-based system with millions
of files from six hours to one hour with a RamSan flash array from TMS. Banco Azteca in Mexico uses
RamSan systems with up to a terabyte of capacity. The move to a flash storage array took place
because, as the bank's data traffic increased, it began experiencing performance problems with
transactions because of degradation in response times from its enterprise storage array. These led
to problems with application stability as, in periods of heavy load, the application was prone to
lock up and deny features to bank customers.
After the move to solid-state, the bank's director of systems, Juan Arevalo Carranza, said, "We
have had excellent results, really spectacular, because now we can process a huge number of
transactions in one day with lower response times.”
RBS would need a terrific increase in transaction speed to recover quickly from a big foul-up: to
recover in a day or less rather than a week or more. Relying on spinning disk is like forcing an
Olympic sprinter to run in a three-legged race. The likelihood is that the size of banking customer
account databases will grow and the number of transactions will grow as well. Both of these things
will exacerbate the difficulties IT staff face when dealing with the next banking system failure
and cause even more customer disruption.
The RBS banking brands -- RBS itself, NatWest, Ulster Bank and so on -- will have had major damage
done to their brand perception. Affected customers that have suffered reputational damage through
not paying their staff or customers will blame RBS. The costs of fixing the problem and
compensating customers for their added expenses will run into millions of pounds. RBS has made
itself look incompetent and slow.
Surely, there is a strong case for the basic computer architecture of its banking IT system to be
looked at and the single biggest bottleneck in the recovery from a banking system failure to
be identified. In this case it is fairly obvious that reliance on spinning disk is that
bottleneck.
These are mission-critical systems, and that means they should not fail, but if they do fail
recovery has to be as rapid as possible.
For RBS the recovery time objective (RTO) from a disaster such as this needs re-examining as recovery for overwhelmed disk-based databases is far too slow. It needs to be completed in hours, not days.
Confidence in RBS is perceptibly weakening, and that’s the last thing the bank needs. It needs to strengthen its operations with solid-state, not more spin.
Chris Mellor is storage editor of The Register.
Email Alerts
This was first published in June 2012
