High-performance computing and high-availability systems have been limited to academics, engineers and scientists. But as the use of big data and cloud computing increases, many other enterprises also want access to this kind of IT infrastructure.
The old view of HPC was of highly specialised computers. These could range from “specialist generalist” computers (ie systems that were produced in significant numbers for focused workloads), such as CISC-based IBM’s RS600 or HP’s PA-RISC systems used in engineering and design systems through to bespoke super-computers based on highly specialised silicon for massive number crunching.
While most common users of HPC systems are scientific, researchers, engineers, academic institutions and few public sector organisations, the demand for processing power and speed is pushing businesses of all sizes towards HPC systems – particularly for transaction processing and data warehouses.
Meanwhile, high availability (HA) refers to a system or component that is continuously operational for a long time.
With HA, complex clusters of servers with dedicated software monitoring the “heartbeat” of the systems and managing failover to redundant systems was the norm. Many organisations were proud to have general systems running at above 99% availability: the aim was seen as the “five nines five” (99.9995%) of availability, or two and a half minutes per year of unplanned downtime.
Pulling the two – HPC and HA – together was expensive and only a few organisations could afford highly available, high-performance systems.
But HPC and HA should now be part of the inherent design of systems. New technical approaches such as big data and cloud computing require technology platforms that can not only meet the resource levels that are required by the workloads, but can adapt resources in realtime to meet changes in the workloads’ needs. HPC and HA should do all of this on a continuous basis with little to no planned or unplanned downtime.
How to create a highly available HPC datacentre
Creating an HA HPC datacentre is now more achievable than ever. Virtualisation and cloud computing provide the basic means to gain good levels of availability. By intelligently virtualising the three main resource pools of computer, storage and network, the failure of any single item should have almost no impact on the running of the rest of the platform.
More on datacentre HPC
- Rise of HPC in enterprises
- How to plan and manage datacentre redundancy
- Datacentre evolution: delivering competitive advantage
- HPC storage requirements: Massive IOPS and parallel file systems
- Defining End-User Requirements in an HPC Project
“Almost no impact” is the key here – if the virtualised pool consists of, say 100 items, a single failure is only 1% of the total and the performance impact should be minimal. However, if the virtualised pool is only 10 items, then the hit will be 10%.
Datacentre professionals must note also that although the use of virtualisation provides a better level of inherent availability, it is not a universal panacea. Virtual images of applications, virtual storage pools and virtual network paths are still dependent on the physical resources assigned to them, and the datacentre design must take this into account.
If the server running the virtual image fails, it will still be necessary to spin up a new image elsewhere on the physical server system and reassign connections. With the right software in place – such as VMware’s vSphere HA, Veeam Backup & Replication or Vision Solution’s Double-Take – recovery from such failure can be automated and the impact to the organisation minimised, with end users often not being aware of any break to their service.
At the storage level, mirroring of live data will be required. For true HA, this will need to be a live-live real-time synchronised approach, but near-real time can be achieved through the use of snapshots where on the failure of a storage array, a “new” array is rapidly spun up and made available based on copies of the live data having been previously made at regular intervals. Most storage vendors offer different forms of storage HA, with EMC, NetApp, Dell, IBM, HDS and HP all having enhanced versions of their general storage offerings with extra HA capabilities.
With networking, the move towards “fabric” networks is providing greater HA. Hierarchical network topologies had basic HA capabilities because they were best-effort constructs, rather than defined point-to-point connections. However, this also meant that they were slow, and on failure could take long periods of time to reconfigure to a point where performance was regained to any extent.
Fabric networks reduce the network to fewer levels, and provide a more dynamic means of reconfiguring the network should any item fail.
Wherever possible, a modular datacentre approach should replace the monlothic
In all the above cases, the key is to still have more resources than are required in an “n+1” (one more item of equipment than is anticipated as being needed) or “n+m” (multiple more items than needed) architecture.
For true single-facility HA, datacentre managers’ resource-strategy has also got to spread to the design of the datacentre itself – power management and distribution, cooling and power backup all need to be reviewed.
Wherever possible, a modular datacentre approach should replace the monolithic. For example, UPS systems should not be bought as single units, where any failure could bring the whole system down. Modular architectures from the likes of Eaton or Emerson allow for component failure while maintaining capability through load balancing and the capacity to replace modules while the system is still live.
Importance of a mirrored facility
However, the ultimate in HA can only be obtained through the complete mirroring of the whole architecture across facilities. This requires heavy investment, and full monitoring of transaction and replications streams. This means when component fails, requiring a change over to the mirrored facility, details of transactions that cannot be fully recovered are maintained, so systems do not become corrupted through partial records being logged.
For HPC, many see the use of a “scale out” architecture as being the solution. Here, to gain greater performance, more resource is thrown into the pool, but it is not the answer to all workloads. For example, the mainframe is still a better platform for many on-line transactional processing (OLTP) workloads, IBM’s Power platform can deal with certain types of number crunching in a more effective manner than Intel or AMD CPUs are able to.
For an HA HPC platform, design will still be key. Start from what the organisation really needs in the way of HA and HPC and design accordingly – ensure the costs of each approach are made apparent to the business so it can make the final decision. You may be surprised that what started out as a solid need for a 100% available platform quickly becomes a 99.999% or less need – and that small decrease in availability can mean the difference of millions of pounds in approach if it transpires that a live-live mirror across multiple datacentres is not required.