I believe that AS/400 cluster support is available in OS/400 V4R4.
What are the benefits of this option in terms of reducing, or
cutting out altogether, systems downtime? Also how does this work
in practice?
IBM did indeed introduce clustering in Version 4 Release 4 of
OS/400. Clustering is aimed at providing extremely high levels of
resilience and availability - or to put it another way, very low
levels of unavailability, writes Nigel Adams. In this article I
will try to describe how clustering support has been implemented on
the AS/400, and what it means to the user.
System downtime can be caused by many different factors. A
disaster, such as fire or flood, can render a system completely
unworkable. Less dramatic, but still potentially serious can be
other unscheduled downtime causes, such as power loss, hardware
problems, network failures, or software bugs. Scheduled downtime
can be caused by factors, such as hardware or software upgrades, or
application of software fixes.
As companies have moved to the world of e-business and the
internet, the amount of downtime is becoming increasingly
important. When computer systems are doing payroll runs or printing
out invoices, a delay of a few hours may be undesirable, but when a
system is being used for e-business, a crash can mean that
effectively the company is closed for business while the system is
down.
It seems that one cannot read a computer magazine without
finding an article in it talking about the crash of one e-business
site or another, and the impact that this has had on the company's
business. A report produced by Eagle Rock Alliance of West Orange,
New Jersey, which looked at the financial impact of outages,
produced figures for different types of businesses, such as $89,500
per hour for airline reservations, $199,500 for catalogue sales,
$2.6m for credit card sales authorisations, and $6.45m for retail
brokerages It is on a flyer IBM has produced called 'Just the facts
about High Availability'. It can be seen that an investment in
additional hardware and software to reduce this downtime can give
very significant payback.
The AS/400 has always had a reputation for being an extremely
reliable system. The Gartner Group produced a report in October
1998, where, based upon observation of real customer installations,
the AS/400 was the best performing single system in terms of
availability, with an average of 5.2 hours unscheduled downtime per
system per year The report was entitled 'Platform Availability
Data: Can you Spare a Minute?' and it was published on 29th October
1998. However, given the examples of costs of downtime given in the
paragraph above, even this level of downtime may be unacceptable.
AS/400 Clustering was introduced as a means of providing even
higher levels of availability.
AS/400 provided a number of availability options prior to the
introduction of clustering. These include journaling, access path
protection, auxiliary storage pools, Raid-5, and disk mirroring.
These functions can be used to greatly reduce the time taken after
a failure. However, clustering takes the AS/400 availability to an
even higher level.
With the introduction of clustering the AS/400 offers a
continuous availability solution. This provides fail over and
switch over capabilities for systems that are used as database
servers or application servers. If a system failure occurs, the
functions that are provided on a clustered server can be switched
to one or more designated backup systems that contain a replica of
the critical resource. The fail over can be automatic in the event
of a system failure, or it can be controlled by manually initiating
a switch over.
In the event of a failure Cluster Resource Services, which is
running on all systems, provides a switch over. This is done with
minimal impact on end users or applications that are running. Data
requests are automatically re-routed to the new primary system.
The clustering capability that became available with Version 4
Release 4 allows you to set up a group of AS/400s in order to
provide extremely high levels of availability. Each system in the
cluster is called a cluster node, and a cluster can contain between
2 and 128 nodes. The nodes are connected via an IP network.
Resources that are available across multiple nodes within the
cluster - and these could be for example AS/400 objects, IP
addresses, applications and physical resources - are known as
cluster resources.
A cluster resource that persists across any single point of
failure within the cluster is known as a resilient resource. A
recovery domain is a group of nodes within the cluster which are
grouped together to provide availability for one or more cluster
resources. Resources that are grouped together for recovery across
a recovery domain are known as a cluster resource group. AS/400
clusters use the separate server or shared-nothing model. This
means that critical resources are not shared between nodes, but are
replicated. Although resources may appear to be shared since they
are accessible from other nodes, at any moment each resource is
actually hosted by a single system.
High availability middleware is a group of applications that
provide replication and management between AS/400s. These
applications have now been extended to provide AS/400 cluster
management middleware. This software is therefore required to
provide the required replication functions and cluster management
capabilities.
There are three High Availability Business Partners who are
active in providing clustering management utilities for the AS/400
- DataMirror, Lakeview Technology, and Vision Solutions. These
three suppliers offer software which can build upon the clustering
capability offered in V4R4, allowing customers to set up and manage
AS/400 clusters for high availability.
The AS/400 has always had a well justified reputation for solid
reliability. With the clustering capability that was introduced
with Version 4 Release 4 the AS/400 offers a solution which should
satisfy the most demanding requirements in terms of continuous
availability.