I believe that AS/400 cluster support is available in OS/400 V4R4. What are the benefits of this option in terms of reducing, or cutting out altogether, systems downtime? Also how does this work in practice?
IBM did indeed introduce clustering in Version 4 Release 4 of OS/400. Clustering is aimed at providing extremely high levels of resilience and availability - or to put it another way, very low levels of unavailability, writes Nigel Adams. In this article I will try to describe how clustering support has been implemented on the AS/400, and what it means to the user.
System downtime can be caused by many different factors. A disaster, such as fire or flood, can render a system completely unworkable. Less dramatic, but still potentially serious can be other unscheduled downtime causes, such as power loss, hardware problems, network failures, or software bugs. Scheduled downtime can be caused by factors, such as hardware or software upgrades, or application of software fixes.
As companies have moved to the world of e-business and the internet, the amount of downtime is becoming increasingly important. When computer systems are doing payroll runs or printing out invoices, a delay of a few hours may be undesirable, but when a system is being used for e-business, a crash can mean that effectively the company is closed for business while the system is down.
It seems that one cannot read a computer magazine without finding an article in it talking about the crash of one e-business site or another, and the impact that this has had on the company's business. A report produced by Eagle Rock Alliance of West Orange, New Jersey, which looked at the financial impact of outages, produced figures for different types of businesses, such as $89,500 per hour for airline reservations, $199,500 for catalogue sales, $2.6m for credit card sales authorisations, and $6.45m for retail brokerages It is on a flyer IBM has produced called 'Just the facts about High Availability'. It can be seen that an investment in additional hardware and software to reduce this downtime can give very significant payback.
The AS/400 has always had a reputation for being an extremely reliable system. The Gartner Group produced a report in October 1998, where, based upon observation of real customer installations, the AS/400 was the best performing single system in terms of availability, with an average of 5.2 hours unscheduled downtime per system per year The report was entitled 'Platform Availability Data: Can you Spare a Minute?' and it was published on 29th October 1998. However, given the examples of costs of downtime given in the paragraph above, even this level of downtime may be unacceptable. AS/400 Clustering was introduced as a means of providing even higher levels of availability.
AS/400 provided a number of availability options prior to the introduction of clustering. These include journaling, access path protection, auxiliary storage pools, Raid-5, and disk mirroring. These functions can be used to greatly reduce the time taken after a failure. However, clustering takes the AS/400 availability to an even higher level.
With the introduction of clustering the AS/400 offers a continuous availability solution. This provides fail over and switch over capabilities for systems that are used as database servers or application servers. If a system failure occurs, the functions that are provided on a clustered server can be switched to one or more designated backup systems that contain a replica of the critical resource. The fail over can be automatic in the event of a system failure, or it can be controlled by manually initiating a switch over.
In the event of a failure Cluster Resource Services, which is running on all systems, provides a switch over. This is done with minimal impact on end users or applications that are running. Data requests are automatically re-routed to the new primary system.
The clustering capability that became available with Version 4 Release 4 allows you to set up a group of AS/400s in order to provide extremely high levels of availability. Each system in the cluster is called a cluster node, and a cluster can contain between 2 and 128 nodes. The nodes are connected via an IP network. Resources that are available across multiple nodes within the cluster - and these could be for example AS/400 objects, IP addresses, applications and physical resources - are known as cluster resources.
A cluster resource that persists across any single point of failure within the cluster is known as a resilient resource. A recovery domain is a group of nodes within the cluster which are grouped together to provide availability for one or more cluster resources. Resources that are grouped together for recovery across a recovery domain are known as a cluster resource group. AS/400 clusters use the separate server or shared-nothing model. This means that critical resources are not shared between nodes, but are replicated. Although resources may appear to be shared since they are accessible from other nodes, at any moment each resource is actually hosted by a single system.
High availability middleware is a group of applications that provide replication and management between AS/400s. These applications have now been extended to provide AS/400 cluster management middleware. This software is therefore required to provide the required replication functions and cluster management capabilities.
There are three High Availability Business Partners who are active in providing clustering management utilities for the AS/400 - DataMirror, Lakeview Technology, and Vision Solutions. These three suppliers offer software which can build upon the clustering capability offered in V4R4, allowing customers to set up and manage AS/400 clusters for high availability.
The AS/400 has always had a well justified reputation for solid reliability. With the clustering capability that was introduced with Version 4 Release 4 the AS/400 offers a solution which should satisfy the most demanding requirements in terms of continuous availability.