Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Designing Disaster Tolerant HA Clusters Using Metrocluster and Continentalclusters: > Chapter 2 Designing a Continental Cluster

Understanding Continental Cluster Concepts

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The Continentalclusters product provides the ability to monitor a high availability cluster and fail over mission critical applications to another cluster if the monitored cluster should become unavailable. In the following example, the Los Angeles cluster runs the mission critical application and replicates data to the New York cluster, which has another copy of the mission critical application ready to run in case of failover. In addition, Continentalclusters supports mutual recovery, which allows for different critical applications to be run on each cluster, with each cluster configured to recover the mission critical applications of the other.

Because clusters may be separated over wide geographical distances, and because they have independent function, the operation of clusters in a Continentalclusters configuration is somewhat different from that of typical Serviceguard clusters. A typical Continentalclusters recovery pair environment is shown in Figure 2-1 “Sample Continentalclusters Configuration”.

Figure 2-1 Sample Continentalclusters Configuration

Sample Continentalclusters Configuration

Two packages are running on the cluster in Los Angeles, and their data is replicated to the cluster in New York. Physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. The New York cluster is running a monitor that checks the status of the Los Angeles cluster. In this example, the Los Angeles cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. The New York cluster is configured with a recovery version of the packages that are running on the Los Angeles cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation.

Mutual Recovery Configuration

Bi-directional failover is supported in what is called a “mutual recovery configuration.” This allows recovery groups to be defined for primary packages running in both component clusters of a recovery pair in the Continentalclusters configuration. Figure 2-2 “Sample Mutual Recovery Configuration” shows a mutual recovery configuration.

Figure 2-2 Sample Mutual Recovery Configuration

Sample Mutual Recovery Configuration

In the above figure, the salespkg is running on the New York cluster and can be recovered by the Los Angeles cluster. Similarly, the custpkg running on the Los Angeles cluster can be recovered by the New York cluster. As stated previously, physical data replication is carried out using ESCON (Enterprise Storage Connect) links between the disk array hardware in New York and Los Angeles via an ESCON/WAN converter at each end. Each cluster is running a monitor that checks the status of the alternate cluster.

As depicted in the above example, each cluster runs just like any Serviceguard cluster, with applications configured in packages that may fail from node to node as necessary. Each cluster is configured with a recovery version of the packages that are running on the alternate cluster. These packages do not run under normal circumstances, but are set to start up when they are needed. In addition, either cluster may run other packages that are not involved in Continentalclusters operation.

Application Recovery in a Continental Cluster

If a given cluster in a recovery pair of a continental cluster should become unavailable, Continentalclusters allows an administrator to issue a single command, cmrecovercl (described later) to transfer mission critical applications from that cluster to another cluster, making sure that the packages do not run on both clusters at the same time. Transfer is not automatic, although it is automated through a recovery command, which a root user must issue. The result after issuing the recovery command is shown in Figure 2-3 “Continental Cluster After Recovery”.

Figure 2-3 Continental Cluster After Recovery

Continental Cluster After Recovery

The movement of an application from one cluster to another cluster does not replace local failover activity; packages are normally configured to fail over from node to node as they would on any high availability cluster. Cluster recovery, failover of packages to a different cluster, occurs only after the following events:

  • Continentalclusters detects the problem

  • Continentalclusters sends a notification of the problem

  • Verify that the monitored cluster has failed

  • Issue the cluster recovery command

Monitoring over a Network

A monitor package running on one cluster tracks the health of another cluster in the recovery pair and sends notification to configured destinations if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change.

The physical separation between clusters will require communication by way of a Local or Wide Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of network service cannot always be differentiated from cluster failure states. This means that if the network is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of the network service.

Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and users will know when to seek additional information before initiating recovery.

Cluster Events

A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 2-1 “Monitored States and Possible Causes” summarizes possible causes for the cluster events with regard to both the monitored cluster and the network. However, in many cases the causes of cluster events are indeterminate without additional information that is not available to the software.

Table 2-1 Monitored States and Possible Causes

Cluster Event (Old state -> New state)Cluster-related CausesNetwork-related Causes
Up -> UnreachableCluster went down; no nodes are responding to network inquiriesNetwork failure
Down -> UnreachableCluster was down and nodes are no longer respondingNetwork failure
Error -> UnreachableError resolved but cluster down and nodes not respondingNetwork failure
Up -> DownCluster has been halted, but at least one node is still responding to network inquiriesNo network problems
Error -> DownError resolved, cluster is downNetwork problem was fixed, cluster is down
Unreachable -> DownCluster nodes were rebooted but the cluster was not startedNetwork came up but the cluster was not running
Up -> ErrorServiceguard version or security file mismatch, software errorNetwork is misconfigured, or DNS server crashed or set up incorrectly
Down -> ErrorServiceguard version or security file mismatch, software errorNetwork is misconfigured, or DNS server crashed or set up incorrectly
Unreachable -> ErrorServiceguard version or security file mismatch, software errorNetwork problem was fixed, but the error condition still exists
Down -> UpCluster startedNo network problems
Unreachable -> UpCluster nodes were rebooted and the cluster startedNetwork came up and the cluster was already running
Error -> UpError resolved, cluster is upNetwork problem was fixed, cluster is up

 

NOTE: There is only one condition under which cmclsentryd will determine that the cluster has Error status: all nodes are unreachable except those which have Serviceguard Error status. (If any nodes are Down or Up, then the cluster status will take one of those values, rather than Error.)

Interpreting the Significance of Cluster Events

Because some cluster events (for example, Up -> Unreachable) can be caused by changes in either a cluster state or a network state, additional independent information is required to achieve the primary objective of determining whether you need to recover a cluster’s applications. Sources of independent information include:

  • Contact with the network provider

  • Contact with the administrator of the monitored cluster

  • Contact with local cluster administrator

  • Contact with company executives

When problematic cluster events persist, obtain as much information as possible, including authorization to recover, if your business practices require this, and then issue the Continentalclusters recovery command, cmrecovercl.

How Notifications Work

A central part of the operation of Continentalclusters is the transmission of notifications following the detection of a cluster event. Notifications occur at specifically coded times, and at two different levels:

  • Alert—when a cluster event should be considered noteworthy.

  • Alarm—when an event shows evidence of a cluster failure.

Notifications are typically sent as:

  • Email messages

  • SNMP traps

  • Text log files

  • OPC messages to OpenView IT/Operations

In addition, notifications are sent to the eventlog file located in the /var/opt/resmon/log/cc directory on the system where monitoring is taking place.

NOTE: An email message can be sent to an address supplied by a pager service that will forward the message to a specified pager system. (Contact your pager service provider for more information.)

Alerts

Alerts are intended as informational. Some typical uses of alerts include:

  • Notification that a cluster has been halted for a significant amount of time.

  • Notification that a cluster has come up after being down or unreachable.

  • Notification that a cluster came down for any reason.

  • Notification that a cluster has been in an unreachable state for a short period of time. An alert is sent in this case as a warning that an alarm might be issued later if the cluster’s state remains unreachable for a longer time.

The expected process in dealing with alerts is to continue watching for additional notifications and to contact individuals at the site of the monitored cluster to see whether problems exist.

Alarms

Alarms are intended to indicate that a cluster failure might have taken place. The most common example of an alarm is the following:

  • Notification that a specified cluster has been in an unreachable state for a significant amount of time.

The expected process in dealing with cluster events that persist at the alarm level is to obtain as much information as possible, including authorization to recover, if your business practices require this. At which point, issue the Continentalclusters recovery command, cmrecovercl.

Creating Notifications for Failure Events

For events that indicate potential cluster failure, display the escalation of concern of the cluster health by defining alerts followed by one or more alarms. The following is a typical sequence:

  • cluster alert at 5 minutes

  • cluster alert at 10 minutes

  • cluster alarm at 15 minutes

This could be accomplished by entering two CLUSTER_ALERT lines in the configuration file, and one CLUSTER_ALARM line. A detailed example is provided in the comments in the ASCII configuration file template, shown in “Editing Section 3—Monitoring Definitions”.

Creating Notifications for Events that Indicate a Return of Service

For those events that indicate that the cluster is back online or that communication with the monitor has been restored, use cluster alerts to show the de-escalation of concern. In this case, use a CLUSTER_ALERT line in the configuration file with a time of zero (0), so that notifications are sent as soon as the return to service is detected.

Performing Cluster Recovery

When a CLUSTER_ALARM is issued, there may be a need for a cluster recovery using the recovery command, cmrecovercl, which is enabled for use by the root user. Cluster recovery is carried out at the site of the recovery cluster by using the cmrecovercl command.

# cmrecovercl

Issuing this command will halt any configured data replication activity from the failed cluster to the recovery cluster, and will start all configured recovery packages on the recovery cluster that are pre-configured in recovery groups. A recovery group is the basic unit of recovery used in a continental cluster configuration. This command will fail if a cluster alarm has not been issued.

If option “-g RecoveryGroup” is specified with the recovery command, then the recovery process of halting data replication activity and starting of the recovery package will only be done for the specified recovery group.

After the cmrecovercl command is issued, there is a delay of at least 90 seconds (per recovery group) while the command ensures that the package is not active on another cluster.

Cluster recovery is done as a last resort, after all other approaches to restore the unavailable cluster have been exhausted. It is important to remember that cluster recovery sets in motion a process that cannot be easily reversed. Unlike the failover of a package from one node to another, failing a package from one cluster to another normally involves a significant quantity of data that is being accessed from a new set of disks. Returning control to the original cluster will involve resynchronizing this data and resetting the roles of the clusters in a process that is easier for some data replication techniques than others.

NOTE: After a recovery, it is not possible to reverse directions and return a package to its original cluster without first reconfiguring the data replication hardware and/or software and synchronizing data. Therefore, be very cautious when deciding to use the cmrecovercl command. It is for this reason, HP recommends that stringent procedures and processes are in place to aid in making the decision to complete a recovery process.

Notes on Packages in a Continental Cluster

Packages have somewhat different behavior in a continental cluster than in a normal Serviceguard environment. There are specific differences in

  • Startup and Switching Characteristics

  • Network Attributes

From Serviceguard A.11.17 and above, you can configure the following package types in a recovery group:

  • Failover

  • Oracle RAC Multi-node packages

In the case of a multi-node package, a recovery process recovers all instances of the package in a recovery cluster.

NOTE: System multi-node packages cannot be configured in Continentalclusters recovery groups. Multi-node packages are supported only for Oracle with CFS or CVM environments.

Startup and Switching Characteristics

Normally, an application (package) can run on only one node at a time in a cluster. However, in a continental cluster, there are two clusters in which an application—the primary package or the recovery package—could operate on the same data. Both the primary and the recovery package must not be allowed to run at the same time. To prevent this, it is important to ensure that packages are not allowed to start automatically and are not started at inappropriate times.

To keep packages from starting up automatically, when a cluster starts, set the AUTO_RUN (PKG_SWITCHING_ENABLED used prior to Serviceguard A.11.12) parameter for all primary and recovery packages to NO. Then use the cmmodpkg command with the -e <packagename> option to start up only the primary packages and enable switching. The cmrecovercl command, when run, will start up the recovery packages and enable switching during the cluster recovery operation.

CAUTION: After initial testing is complete, the cmrunpkg and cmmodpkg commands or the equivalent options in SAM should never be used to start a recovery package unless cluster recovery has already taken place.

To prevent packages from being started at the wrong time and in the wrong place, use the following strategies:

  • Set the AUTO_RUN (PKG_SWITCHING_ENABLED used prior to Serviceguard A.11.12) parameter for all primary and recovery packages to NO.

  • Ensure that recovery package names are well known, and that personnel understand they should never be started with a cmrunpkg or cmmodpkg command unless the cmrecovercl command has been invoked first.

  • If a cluster has no packages to run before recovery, then do not allow packages to be run on that cluster with Serviceguard Manager.

Network Attributes

Another important difference between the packages configured in a continental cluster and the packages configured in a standard Serviceguard cluster is that the same or different subnets can be used for primary cluster and recovery cluster configurations. In addition, the same or different relocatable IP addresses can be used for the primary package and its corresponding recovery package. The client application must be designed properly to connect to the appropriate IP address following a recovery operation.

How Serviceguard commands work in a Continentalclusters

Continentalclusters packages are manipulated manually by the user via Serviceguard commands and by cmcld automatically in the same way as any other packages.

In a continental cluster the recovery package are not allowed to run at the same time as the primary, data sender, or data receiver packages. To enforce this, several Serviceguard commands behave in a slightly different manner when used in a continental cluster.

Table 2-2 “Serviceguard and Continentalclusters Commands” describes the Serviceguard commands whose behavior is different in a continental cluster environment. Specifically, when one of the commands listed in Table 2-2 “Serviceguard and Continentalclusters Commands” attempts to start or enable switching of a package, it first checks the status of the other packages in the recovery group. Based on the status, the operation is either allowed or disallowed.

The checking is done based on the stable clusters' environment and the proper functioning of the network communication. In the case when the network communication between clusters can not be established or the cluster or package status can not be determined, it is must be checked manually to ensure that the operation to be performed on the target package will not have a conflict with other packages configured in the same recovery group.

Table 2-2 Serviceguard and Continentalclusters Commands

Commands

How the commands work in Serviceguard

How the commands work in Continentalclusters

cmrunpkgruns a packageWill not start a recovery package if any of the primary, data receiver, or data sender package in the same recovery group is running or enabled. Will not start a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled.
cmmodpkg -eenable switching attribute for a highly available package

Will not enable switching on a recovery package if any of the primary, data receiver, or data sender package in the same recovery group is running or enabled. Will not enable a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled.

cmhaltnode -fhalts a node in a highly available clusterWill not re-enable switching on a recovery package if any of the primary, data receiver, or data sender package in the same recovery group is running or enabled. Will not re-enable a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled.
cmhaltcl -f

This command will halt daemons on all currently running systems

Will not re-enable switching on a recovery package if any of the primary, data receiver, or data sender package in the same recovery group is running or enabled. Will not re-enable a primary, data receiver, or data sender package if the recovery package in the same recovery group is running or enabled.

 

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.