Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Designing Disaster Tolerant High Availability Clusters: > Chapter 4 Designing a Continental Cluster

Understanding Continental Cluster Concepts

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The Continentalclusters product provides the ability to monitor a high availability cluster and fail over mission critical applications to another cluster if the monitored cluster should become unavailable. In the following example, the Los Angeles cluster runs the mission critical application and replicates data to the New York cluster, which has another copy of the mission critical application ready to run in case of failover. In addition, Continentalclusters supports mutual recovery, which allows for mission critical applications to be run on each cluster, with each cluster configured to recover the mission critical applications of the other.

Because clusters may be separated over wide geographical distances, and because they have independent function, the operation of clusters in a Continentalclusters configuration is somewhat different from that of typical Serviceguard clusters. A typical Continentalclusters environment is shown in Figure 4-1 “Sample Continentalclusters Configuration”.

Figure 4-1 Sample Continentalclusters Configuration

Sample Continentalclusters Configuration

Two packages are running on the cluster in Los Angeles, and their data is replicated to the cluster in New York. Physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. The New York cluster is running a monitor that checks the status of the Los Angeles cluster. In this example, the Los Angeles cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. The New York cluster is configured with a recovery version of the packages that are running on the Los Angeles cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation.

Mutual Recovery Configuration

Bi-directional failover is supported in what is called a mutual recovery configuration. This lets you define recovery groups for primary packages running in both component clusters in the Continentalclusters configuration. Figure 4-2 “Sample Mutual Recovery Configuration” shows a mutual recovery configuration.

Figure 4-2 Sample Mutual Recovery Configuration

Sample Mutual Recovery Configuration

In the above figure, the salespkg is running on the New York cluster and can be recovered by the Los Angeles cluster. Similarly, the custpkg running on the Los Angeles cluster can be recovered by the New York cluster. As stated previously, physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. Each cluster is running a monitor that checks the status of the alternate cluster.

As shown in the above example, each cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. Each cluster is configured with a recovery version of the packages that are running on the alternate cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation.

Application Recovery in a Continental Cluster

If a given cluster of a continental cluster should become unavailable, Continentalclusters allows an administrator to issue a single command (cmrecovercl, described later) to transfer mission critical applications from the that cluster to another cluster, making sure that the packages do not run on both clusters at the same time. Transfer is not automatic, although it is automated through a recovery command, which a root user must issue. The result after issuing the recovery command is shown in Figure 4-3 “Continental Cluster After Recovery”.

Figure 4-3 Continental Cluster After Recovery

Continental Cluster After Recovery

The movement of an application from one cluster to another cluster does not replace local failover activity; packages are normally configured to fail over from node to node as they would on any high availability cluster. Cluster recovery—failover of packages to a different cluster—occurs only after the following:

  • Continentalclusters detects the problem.

  • Continentalclusters sends you a notification of the problem.

  • You verify that the monitored cluster has failed.

  • You issue the cluster recovery command.

Monitoring over a Wide Area Network

A monitor package running on one cluster tracks the health of another cluster and sends notification to system administrators if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change.

The physical separation between clusters will require communication by way of a Wide Area Network (WAN). Since the polling takes place across the WAN, interruptions of WAN service cannot always be differentiated from cluster failure states. This means that if the WAN is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of WAN service.

Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and so that users will know when to seek additional information before initiating recovery.

Cluster Events

A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 4-1 “Monitored States and Possible Causes” summarizes possible causes for the cluster events with regard to both the monitored cluster and the WAN. It is clear that in many cases, the causes of cluster events are indeterminate without additional information that is not available to the software.

Table 4-1 Monitored States and Possible Causes

Cluster Event (Old state -> New state)Cluster-related causesWAN-related causes
Up -> UnreachableCluster went down; no nodes are responding to network inquiriesWAN failure
Down -> UnreachableCluster was down and nodes are no longer respondingWAN failure
Error -> UnreachableError resolved but cluster down and nodes not responding; or WAN-related causeWAN failure
Up -> DownCluster has been halted, but at least one node is still responding to network inquiriesNo WAN problems
Error -> DownError resolved, cluster is downWAN problem was fixed, cluster is down
Unreachable -> DownCluster nodes were rebooted but the cluster was not startedWAN came up but the cluster was not running
Up -> ErrorServiceguard version or security file mismatch, software errorWAN is misconfigured, or DNS server crashed or set up incorrectly
Down -> ErrorServiceguard version or security file mismatch, software errorWAN is misconfigured, or DNS server crashed or set up incorrectly
Unreachable -> ErrorServiceguard version or security file mismatch, software errorWAN problem was fixed, but the error condition still exists
Down -> UpCluster startedNo WAN problems
Unreachable -> UpCluster nodes were rebooted and the cluster startedWAN came up and the cluster was already running
Error -> UpError resolved, cluster is upWAN problem was fixed, cluster is up

 

NOTE: There is only one condition under which cmclsentryd will determine that the cluster has Error status: all nodes are unreachable except those which have Serviceguard Error status. (If any nodes are Down or Up, then the cluster status will take one of those values, rather than Error.)

Interpreting the Significance of Cluster Events

Because some cluster events (e.g., Up -> Unreachable) can be caused by changes in either a cluster state or a WAN state, additional independent information is required to achieve the primary objective of determining whether you need to recover a cluster’s applications. Sources of independent information include:

  • Contact with the WAN provider

  • Contact with the administrator of the monitored cluster

  • Contact with local cluster administrator

  • Contact with company executives

When worrisome cluster events persist, you obtain as much information as possible, including authorization to recover, if your business practices require this, and then issue the recovery command.

How Notifications Work

A central part of the operation of Continentalclusters is the transmission of notifications following the detection of a cluster event. Notifications occur at specifically coded times, and at two different levels:

  • Alert—when a cluster event should be considered noteworthy.

  • Alarm—when an event shows evidence of a cluster failure.

Notifications are typically sent as:

  • Email messages

  • SNMP traps

  • Text log files

  • OPC messages to OpenView IT/Operations

In addition, notifications are sent to an event log on the system where monitoring is taking place.

NOTE: An email message can be sent to an address supplied by a pager service that will forward the message to a specified pager system. Contact your pager service provider for more information.

Alerts

Alerts are intended as informational. Some typical uses of alerts include:

  • Notification that a cluster has been halted for a significant amount of time.

  • Notification that a cluster has come up after being down or unreachable.

  • Notification that a cluster came down for any reason.

  • Notification that a cluster has been in an unreachable state for a short period of time. An alert is sent in this case as a warning that an alarm might be issued later if the cluster’s state remains unreachable for a longer time.

The expected process in dealing with alerts is to continue watching for additional notifications and to contact individuals at the site of the monitored cluster to see whether problems exist.

Alarms

Alarms are intended to indicate that a cluster failure might have taken place. The most common example of an alarm is the following:

  • Notification that a cluster has been in an unreachable state for a significant amount of time that you specify.

The expected process in dealing with cluster events that persist at the alarm level is to obtain as much information as possible, including authorization to recover, if your business practices require this, and then to issue the recovery command.

Creating Notifications for Failure Events

For events that might indicate cluster failure, you can show the escalation of your concern over cluster health by defining alerts followed by one or more alarms. A typical sequence is to issue a cluster alert at 5 minutes and 10 minutes followed by a cluster alarm at 15 minutes. This could be accomplished by entering two CLUSTER_ALERT lines in the configuration file, and one CLUSTER_ALARM line. A detailed example is provided in the comments in the ASCII configuration file template, shown in “Editing Section 3—Monitoring Definitions”

Creating Notifications for Events that Indicate a Return of Service

For those events that indicate that the cluster is back online or that communication with the monitor has been restored, use cluster alerts to show the de-escalation of concern. In this case, use a CLUSTER_ALERT line in the configuration file with a time of zero (0), so that notifications are sent as soon as the return to service is detected.

Performing Cluster Recovery

When a CLUSTER_ALARM is issued, there may be a need for recovery, and the recovery command, cmrecovercl, is enabled for use by the root user. Cluster recovery is carried out at the site of the recovery cluster by using the cmrecovercl command, as follows:

# cmrecovercl

This command will fail if a cluster alarm has not been issued. The command has the effect of halting any data replication activity from the failed cluster to the local cluster, and starting up on the local cluster all the recovery packages that are pre-configured in recovery groups, which are the units of recovery in a continental cluster.

If option “-g RecoveryGroup” is specified with the command, the recovery process, halting of data replication activity and starting of recovery package, will be done only for the specified recovery group.

After the cmrecovercl command is issued, there is a delay of at least 90 seconds per recovery group as the command makes sure that the package is not active on another cluster.

Cluster recovery is done as a last resort, after all other approaches to restore the unavailable cluster have been exhausted. It is important to remember that cluster recovery sets in motion a process that cannot be easily reversed. Unlike the failover of a package from one node to another, failing a package from one cluster to another normally involves a significant quantity of data that is being accessed from a new set of disks. Returning control to the original cluster will involve resynchronizing this data and resetting the roles of the clusters in a process that is easier for some data replication techniques than others.

NOTE: After a recovery, you cannot reverse directions and return a package to its original cluster without first reconfiguring the data replication hardware and/or software and synchronizing data. Therefore, you should be very cautious when deciding to use the cmrecovercl command.

Notes on Packages in a Continental Cluster

Packages have somewhat different behavior in a continental cluster than in a normal Serviceguard environment. There are specific differences in

  • Startup and Switching Characteristics

  • Network Attributes

Startup and Switching Characteristics

Normally, an application (package) can run on only one node at a time in a cluster. However, in a continental cluster, there are two clusters in which an application—the primary package or the recovery package—could operate on the same data. The primary package and the recovery package must not both be allowed to run at the same time. To prevent this, it is very important to ensure that packages are not allowed to start automatically and are not started up at inappropriate times.

To keep packages from starting up automatically when a cluster starts, you must set the AUTO_RUN (PKG_SWITCHING_ENABLED used prior to Serviceguard 11.12) parameter for all primary and recovery packages to NO. Then use the cmmodpkg command with -e <packagename> option to start up only the primary packages and enable switching. The cmrecovercl command, when run, will start up the recovery packages and enable switching during the cluster recovery operation.

WARNING! After initial testing is complete, the cmrunpkg and cmmodpkg commands or the equivalent options in SAM should never be used to start a recovery package unless cluster recovery has already taken place.

To prevent packages from being started at the wrong time and in the wrong place, you can use the following strategies:

  • Set the AUTO_RUN(PKG_SWITCHING_ENABLED used prior to Serviceguard 11.12) parameter for all primary and recovery packages to NO.

  • Ensure that recovery package names are well known, and that personnel understand they should never be started with a cmrunpkg or cmmodpkg command unless the cmrecovercl command has been invoked first.

  • If a cluster has no packages to run before recovery, then do not allow packages to be run on that cluster with SAM.

Network Attributes

Another important difference between the packages in a continental cluster and the packages configured in a standard Serviceguard cluster is that different subnets are used in recovery packages than the subnets in the primary packages. The client application must be designed to reconnect to the appropriate IP address following a recovery operation.

How Serviceguard commands work in a Continentalcluster

Continentalclusters packages are manipulated manually by the user via Serviceguard commands and by cmcld automatically in the same way as any other packages.

In a continental cluster the recovery package are not allowed to run at the same time as the primary, data sender, or data receiver packages. To enforce this, several Serviceguard commands behave in a slightly different manner when used in a continental cluster.

Table 4-2 “Serviceguard and Continentalclusters Commands” describes the Serviceguard commands whose behavior is different in a continental cluster environment. Specifically, when one of the following commands attempts to start or enable switching of a package, it first checks the status of the other packages in the recovery group. Based on this status, the operation is either allowed or disallowed.

Table 4-2 Serviceguard and Continentalclusters Commands

Commands

How the commands work in SG

How the commands work in Continentalclusters

cmrunpkgruns a packagewill not start a recovery package if any of the primary, data receiver, or data sender package in the same recovery group is running or enabled. will not start a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled.

cmmodpkg -e

enable switching attribute for a highly available package

will not enable switching on a recovery package if any of the primary, data receiver, or data sender package is in the same recovery group is running or enabled. will not enable a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled.

cmhaltnode -f

halts a node in a highly available clusterwill not re-enable switching on a recovery package if any of the primary, data receiver, or data sender package in the same recovery group is running or enabled. will not re-enable a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled

cmhaltcl -f

This command will halt daemons on all currently running systems

will not re-enable switching on a recovery package if any of the primary, data receiver, or data sender package is in the same recovery group is running or enabled. will not re-enable a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled

 

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.