How HBOS and Tesco survive the storage test

Buying the hardware and the software to manage your storage needs from one supplier is a logical choice for most businesses, as there is no question who is responsible when it goes wrong.

Buying the hardware and the software to manage your storage needs from one supplier is a logical choice for most businesses, as there is no question who is responsible when it goes wrong.

But it is not always the best option to use what comes out of the box. Sometimes, buying a third-party product can make a big difference. This is what retail bank HBOS found when it needed to monitor its storage area network (San) in order to tweak performance.

Another challenge IT departments face is that storage is constantly growing. It is a costly exercise to upgrade storage hardware and database and network infrastructure to keep pace with this growth.

As supermarket Tesco found, upgrading is not the only option. Sometimes, a low-cost option can make a big difference. Rather than turn-charging the infrastructure, Tesco simply compressed the data.

Case study: HBOS

Keeping 67,000 users of real-time applications happy is not easy, but the storage team at HBOS has found high-level monitoring tools help. The UK's largest mortgage and savings provider has a relationship with two out of every five UK households, so the storage team has its work cut out optimising response times of systems in order to keep the relationship with customers and internal users smooth.

"We have lots of different systems that need to be accessed in real time," says Simon Close, service manager of storage management services at HBOS. These include datawarehouses, large online database systems and real-time transaction analysis applications holding terabytes of customer data. "Such applications demand swift and consistent response times - sub-second delays will not do," says Close.

In order to provide continuous data access, a preliminary San was implemented in 2000, and another four Sans have since been rolled out. The Sans run specific data access requirements for functions including back-up, production and pre-production. Each is maintained separately, as the data stored on it is accessed in a different way. "We like to keep the back-up San separate so it does not impact on live transactions," says Close.

Today, more than 5,000 San ports and 100 switches constitute five replicated, cross-site "fabrics" - essentially ring-fenced configurations of dedicated storage, connected by fibre links. The fabric is continually replenished and HBOS is currently upgrading legacy 3900 switches to 4Gbyte Director models from Brocade. This provides an upgrade from 32 port switches to 256 port versions, and lets HBOS stay ahead of the game in terms of bandwidth and throughput.

Having such robust Sans in place does secure data across HBOS' two datacentres located in West Yorkshire, but it does not guarantee the optimal performance of the applications that it supports. "Sans are not the end of all performance problems - in many ways they are the start," says Close.

For example, the IT department must now think about the way that each application is designed and deployed. "An Oracle application may entail specific different read and write ratios and require storage to be provisioned in different ways compared to other objects," says Richard Briggs, senior technical infrastructure developer in HBOS' storage management services group.

"Similarly, application changes, operating system patches, or simply adding servers or storage modules may all have an impact on San performance," Briggs adds.

Because of the scale of the datacentre and storage operation, HBOS employs a dedicated team to look after this element of the IT operation. A team of 30 looks after storage, and within this, a six-strong team tends to the Sans. The latter found themselves in a political hotspot when, despite the improved resilience it brought to the operation, the Sans fast became the victim of the "blame it on the network" mantra.

"One of the biggest challenges we face is the perceived performance issues related to our Sans - for example when a user tells us his or her application is slow," says Close. Indeed, the majority of problems on the Sans are not failures, but intermittent "niggles", such as slow response times or a server reacting badly to delayed I/O. The possibilities are endless as to where the source of the problem might be, but nonetheless, "everyone was blaming the Sans when things went wrong", says Briggs.

The team had been relying on hardware-specific tools, which provided a lot of statistics on a product's performance rather than any lower-level diagnostic information. "None of the tools that we had been using to monitor the Sans' performance could get us to the root of the problem, and some problems in the early days went unsolved," says Briggs.

Because of these shortcomings, HBOS used a process of elimination method, even though it recognised it was not an effective way to handle such issues. Whenever there was a problem it could not solve, the team would call its third-party maintenance supplier, HDS, which would bring in an analyser. However, this entailed shutting down several servers, both to plug in the analyser and to take it out again.

The HBOS storage team realised it needed its own dedicated San monitoring tool to allow it to troubleshoot with greater accuracy, and selected Finisar's Netwisdom. The impact was instantaneous, if a little alarming. "When we first plugged it in, the dashboard lit up red and we were shocked at the amount of errors being reported," says Close. Since then the San team has learned to prioritise and retune the alerting thresholds for different applications.

HBOS can now solve problems in a proactive way, such as when the team pinpointed an underperforming piece of middleware servicing an online banking application. "The message broker system was struggling to cope with the demands of web-based traffic and customers were getting impatient," says Close.

Netwisdom identified the "hot" area on the disc and prompted the storage team to work with Unix administrators, database administrators and the application team to improve response times for the application.

Third-party suppliers like it too, says Briggs. "It means they can get to the root of the problem quicker, rather than coming on site and poring through logs. It helps them understand at a detailed technical level how their products are operating in a complex and demanding environment. This allows them to identify and isolate not only defects, but also enhance their products."

However, Briggs does note that, "On a day-to-day basis we are still not fully there yet - it is an ongoing process of tuning and optimisation." This is inevitable as performance benchmarks degrade over time and customers raise the bar in their expectations.

Importantly, however, finger pointing at the Sans has reduced. "We have been able to demonstrate that San response times are well within the thresholds that have been dictated. The problem is either in the way the application has been configured, or how the storage was originally provisioned," Close adds.

"Armed with that information, groups can evaluate their application design and redevelop or tune the application accordingly."

Case study: Tesco

Increasing storage requirements at Tesco meant backing-up data was absorbing valuable time of the IT operations team, and the retailer faced the prospect of the spiralling data volumes undermining the success of its online shopping business.

The rate of growth of Tesco's online shopping operation is not dissimilar from the main store's business, at about 30% year on year, says Chris Howell, IT manager of operations and infrastructure at The online shopping site generates operational data that is held in a series of SQL Server databases. This includes information about available products, billing and delivery information for customers, as well as any "favourites" that they have saved.

With the quantity of data increasing rapidly and the time taken to perform the back-up approaching five hours, there was not much room for error in the retailer's slender back-up window. Howell was concerned that the main back-up procedure should not breach the period between midnight and 6am. "If a disaster occurred after 8am and we had not succeeded in backing-up, there would be problems opening stores' sites," he says.

A team of three database administrators was already performing many back-ups in each 24-hour period using traditional log shipping and replication methods. However, Howell decided to make a pre-emptive strike before the besieged storage operation ran into trouble. "We want customers to have the best possible experience, and anything that slows their online journey is not good."

Howell reviewed three options: a database overhaul, storage compression techniques and a network upgrade to increase the data flow. Data compression far outweighed a database architecture overhaul or an upgrade of network capacity in terms of value for money, Howell discovered. Investing in compression software cost under £20,000 and Tesco expects the software to secure another five years' life expectancy for its current back-up arrangement.

Beefing up the databases would have been a viable alternative, but the investment would have been more expensive and taken a lot longer. "Investment in a major piece of database hardware and the corresponding time in the configuration and migration would not have stacked up," says Howell.

By deploying compression tool Litespeed from software supplier Quest Software, Tesco has secured a 67% improvement in performance. "Our 120Gbyte SQL Server previously took 59 minutes to back-up, but with Litespeed it now takes 18 minutes. This improvement means that we have ample time to deal with any back-up problems to ensure that is always open," says Howell.

Another important factor in Litespeed's favour was the minimal amount of time required to execute the upgrade. "It was easy to install and we only needed to call the Quest helpdesk once," says Howell. This was largely because of the simplicity of the concept. The product works in tandem with SQL Server which has its own native back-up, and Litespeed simply plugs into and supplements this.

Data compression has future-proofed the operation for the next five years, and gives Howell a breather to concentrate on other aspects of the storage operation. Freed from the headache of back-up problems, Howell can concentrate on retuning the rest of the storage infrastructure that keeps the business growing. "It is finetuning performance of the Sans and network attached storage infrastructure that occupies me now."

The merits of data de-duplication>>

Tesco standardisation leads the pack >>


Tesco >>

Comment on this story: [email protected]

Read more on Business applications