These release notes cover the December 2000 (IPR 0012) release of Support Plus for HP-UX 11.00/10.20 running on S800/S700 systems.
- Overview
- Configuring Hardware Monitoring
- Documentation
- Changes
- Known Problems
- Monitors Provided
- Monitor Dependencies
- Defect Reporting
- SD Product Structure
NOTE: As of the September 1999 release, the name of the Diagnostic/IPR Media has been changed to Support Plus. In addition, the format has changed so that there is a separate CD-ROM for each version of the operating system (HP-UX 10.20 and HP-UX 11.0).
Included on the Support Plus CD-ROM are the EMS Hardware Monitors - an important tool for maintaining system availability. The EMS hardware monitors allow you to monitor the operation of a wide variety of hardware products and be alerted immediately if any failure or other unusual event occurs. Hardware event monitoring is available to users running HP-UX 10.20 or 11.X (IPR 9902 and later).
Hardware event monitoring provides a high level of protection against system hardware failure. By using hardware event monitoring, you can virtually eliminate undetected hardware failures that could interrupt system operation or cause data loss.
Configuring Hardware Monitoring
The EMS Hardware Monitors are installed at the same time as the Support Tools Manager. Once the monitoring software is installed, monitoring is automatically enabled.
By default, messages regarding major warning, serious and critical events that occur on hardware being monitored will be:
All events will be stored in /var/opt/resmon/log/event.log.
- Written to /var/adm/syslog/syslog.log
- Sent to EMAIL address root
To configure, enable, or disable hardware event monitoring, run the monitoring request manager: /etc/opt/resmon/lbin/monconfig .
The Peripheral Status Monitor (PSM) and the The Kernel Resource Monitor (krmond) are configured differently. They use the EMS GUI. See: http://docs.hp.com/hpux/onlinedocs/diag/ems/ems_gui.htm
For the latest and most complete information on EMS Hardware Monitors and the Support Tools Manager (STM), see the Web page "Diagnostics":
http://docs.hp.com/hpux/diag/At this site, you will find Overviews, Tutorials, Quick Reference Cards, Frequently Asked Questions (FAQs), and much other material.For complete information on installing and using EMS hardware monitors, as well as a list of supported hardware, refer to the "EMS Hardware Monitors User's Guide" available at the above site. An electronic copy of this book is also included on the Support Plus CD-ROM in the <mount_point>/DIAGNOSTICS directory.
Changes in the EMS Hardware Monitors for the the December 2000 (IPR 0012) release include:
- Changes to Multiple Monitors
- Changes to Individual Monitors
- Changes to Platform and Interface
- Customer-Visible Interface Changes
- Added Multiple-view ("Predictive-enabled") support to the following monitors:
- Disk Monitor (disk_em)
- High Availability Disk Array Monitor (ha_disk_array)
- Updated some events reported by the following monitors so that their severity levels now match the equivalent events in the SCSI library.
- Disk Monitor (disk_em)
- Fibre Channel Adapters Monitor (dm_FCMS_adapter)
- Fibre Channel Adapter Model A5158 Monitor (dm_TL_adapter)
- High Availability Disk Array Monitor (ha_disk_array)
- (JAGad25572) Renumbered events for the following monitors:
In the client configuration files (*.clcfg) for these monitors, renumbered events 100055, 100088 and 100299 to 100050, 100168 and 100999. Previously, clcfg files for these monitors were missing entries for events 100050, 100168 and 100999. With this omission, these events would not be properly qualified when they occurred.
- SCSI Tape Devices (dm_stape)
- High Availability Disk Array (ha_disk_array)
- Fibre Channel Adapters (dm_FCMS_adapter)
- Fibre Channel Adapter Model A5158 (dm_TL_adapter)
- Updated the SCSI default events so that events 100091 and 100772 ( POWER ON / BUS RESET ) now have a Severity Level of MAJOR WARNING rather than SERIOUS. The versions of these events for TAPE devices remain with a Severity Level of INFORMATION.
- Fixed two problems that may occur on multiple monitors:
- (JAGad30207) When a monitor request is unregistered, random processes may be killed, if the diaglogd process cannot be found.
- (JAGad30648) Monitors don't log errors and behave poorly when boot time is not available. The symptoms of this problem are that none of the EMS HW monitors with hardware paths as instances will show any resource instances. In addition, they will log errors in the api.log indicating "EMS error in message loop" at startup and any time they are enabled.
Changes to Individual Monitors
- Disk Monitor (disk_em):
- (JAGad23856) Fixed a problem, whereby disk_em aborts with SIGSEGV after the hardware is changed and 'ioscan' is performed.
- (JAGad26216) Disk monitor event #6 is now generated for SMART devices. Previously, this event was just generated for devices with no SMART implementation.
- (JAGad23856) Fixed a problem whereby the monitor aborts with SIGSEGV on system that has a TOSHIBA XM-5401 CD-ROM.
- (JAGad26216) The disk monitor now checks for defect data regardless of whether the drive supports the log sense command. Previously, the disk monitor didn't check for defect data if the drive supported the log_sense command.
- (JAGad26177) Fixed a problem whereby, after a drive is pulled out from a A3311A/A3312A High Availability Storage System ("Jamaica" box), the disk monitor marks all devices down, even though 'ioscan -kf' shows the other devices to be accessible.
- FC60 Disk Array Monitor (fc60mon): (JAGad30655) Enhanced monitor to provide additional information in the MEL event log output. The MEL event log output now includes the Field Replaceable Unit (FRU) and LUN information if it is available (for some events this information is not available). The help files were also changed to reflect this change.
- Fibre Channel Adapter Model A5158 (dm_TL_adapter). Updated for changes in the driver.
- High Availability Disk Array (ha_disk_array): changed the default severiy level for several events. Events 3, 20, 27-30, 35-39, 44-51, 60, 64, and 67 are increased in severity level from INFORMATION to MAJOR_WARNING.
- LPMC Monitor (lpmc_em): added support for for A5191A ("Rhapsody Wave 2"): L1000/540, L2000/540.
- Kernel Resource Monitor (krmond): updated to version A.11.00.02 to fix several install problems which occurred only if the nflock or ncallout monitors are enabled. Errors were reported through EMS. There have been no customer reports about these problems. The workaround has been to run swconfig on the EMS-KRMonitor package.
- RemoteMonitor:
- Added events 243-245 and 2001-2026 to the Remote Monitor. Removed a conflict with test event 103.
- (JAGad24380) Allowed for manual enablement on a per-device-type basis. If all device types are disabled the monitor will shut down.
- SCSI Tape Devices Monitor (dm_stape) monitor:
- Modified the monitor so that it reads the dm_stape.cfg file to determine which, if any, errors to insert while the program is running.
- Added Product IDs for C7369A Tape drive ("Ultrium LTO") and C7483A Tape drive ("Benchmark") to list of supported devices. The Product IDs are "Ultrium 1-SCSI"and "DLT1" respectively.
- (JAGad29068) Increased the suppression time for Event 201 from 1440 minutes (1 day) to 10080 minutes (1 week). Changed severity for Event 201 from CRITICAL to MAJOR_WARNING.
- (JAGad29066) Reduced the severity of several events to comply with requests from Predictive Support:
Event: IPR0009 Severity: IPR0012 Severity: ====== ================= ================= 20 SERIOUS MAJOR_WARNING 22 SERIOUS MAJOR_WARNING 23 SERIOUS MAJOR_WARNING 30 CRITICAL SERIOUS 31 CRITICAL SERIOUS 33 CRITICAL MAJOR_WARNING 38 SERIOUS MAJOR_WARNING 40 CRITICAL MINOR_WARNING 42 MAJOR_WARNING MINOR_WARNING 43 SERIOUS MAJOR_WARNING 44 SERIOUS MINOR_WARNING 45 SERIOUS MAJOR_WARNING 201 CRITICAL MAJOR_WARNING 203 CRITICAL MAJOR_WARNING 204 CRITICAL MAJOR_WARNING 209 SERIOUS MAJOR_WARNING 210 SERIOUS MAJOR_WARNING 216 CRITICAL SERIOUS 217 CRITICAL MAJOR_WARNING 218 CRITICAL MAJOR_WARNING 230 SERIOUS MAJOR_WARNING 901 SERIOUS MAJOR_WARNING- UPS Monitor (dm_ups): (JAGad26654) The severity level for the following events was too high, and was lowered from either CRITICAL or SERIOUS to INFORMATION. 3, 4, 13, 15, 17, 19, 22, 25, 27, 29, 31, 33, 35, 38, 41. Background: Many of the events for the dm_ups monitor occur in pairs. For a given catastrophic event such as "high UPS ambient temperature shutdown", another event such as "high ambient temperature shutdown CLEAR" occurs when the problem is resolved. All the clear events were originally classified the same as the pair. If the "high ambient temperature shutdown" event was classified as "CRITICAL", then the clear event was also classified as CRITICAL. This was incorrect and all clear events were changed to INFORMATION severity.
- Core Hardware Monitor (dm_core_hw). Events 33 and 34 (for overtemp) are no longer suppressed. Whenever the hardware reports a transition to one of the two overtemp states that can be detected, event 33 or 34 will be generated.
Detailed description: When the core hardware monitor detects an overtemp situation, it generates an event 33. If things get warmer still, it generates an event 34. (Typically event 34 is never actually generated because by default, the envd config file is set up to shut down the system when this condition occurs.)
If either of these conditions occur, and then the temperature returns to normal, the system will wait for 15 minutes (programmable in the core hardware monitor config file) before generating an event 35, saying that the temperature has returned to normal. The idea was to "maintain a state of wariness," so to speak, until we had a way to know that the temperature was normal for long enough that we felt that the danger had subsided.
This is further complicated by the fact that, on N-Class and newer systems, we don't get a status which tells us that the temperature has returned to normal, and therefore never generate event 35.
If event 33 and/or 34 are generated, and then the temperature returns to normal, and then the system gets warm again, the hardware would tell us that we should generate another event 33. The way things were, we didn't generate an event 33 because it was within the suppression time. Since it seems likely that the hardware is reporting a new condition, and since time to fix an overtemp situation is critical, we decided to report all overtemp conditions.- Core Hardware Monitor (dm_core_hw). Events indicating fan/power supply failure have been separated from events which indicate whether or not sufficient fans/power supplies are functioning. These changes were made in the 11i release and have been carried over into the Dec 00 release.
The following dm_core_hw events are no longer used; they have been replaced by other events:
Error description Cabinet fans (blowers) I/O fans Cabinet power supplies I/O power supplies Component failed; sufficient left 5 10 15 20 Component failed; NOT sufficient left 6 11 16 21 Sufficient # of components installed 9 14 19 24 The dm_core_hw events in the following table replace the events in the previous table.
Error description Severity Cabinet fans (blowers) I/O fans Cabinet power supplies I/O power supplies Backplane power boards Component failed Serious 39 43 47 51 55 All working Info 08 13 18 23 59 N-1 Critical 07 12 17 22 60 N, configured for N+1 Serious 40 44 48 52 56 N, configured for N Info 41 45 49 53 57 N+1 Info 42 46 50 54 58 In the table, "N" stands for the number of a component (like a fan) which is required for the system to operate normally. N+ means that you can lose one and still be OK. Some configurations of some systems also have N-, which technically means that you don't have enough to run normally. For example, some systems can lose a power supply and go to N-. It will keep on running (unless the power supply overheats), but it won't boot in this configuration.
Changes to Platform and Interface
- Fixed a problem with diaglogd. Symptoms: a failure to get events from the monitor for problems logged from the OS and a continuously growing diaglogd_hold_list file in /var/stm/data .
- (JAGad25912) Fixed problem whereby monconfig would core-dump when the user requested help at the prompt for entering a monitor request entry number. Monconfig would core-dump if the user selected delete or modify a monitoring request and help was requested at the prompt requesting the user to select the entry to delete or modify. Monconfig was modified to return to the prompt for the user to select the entry, rather than core dump.
- Fixed problem whereby the toggle_switch process would hang during startup or shutdown for monitors which did not behave properly. The problem is rare; only one case has been reported. It occurred when a monitor started properly and accepted configuration, but then, for some reason, stopped responding to EMS. Some monitors behave this way if one shuts down diagnostics without shutting down monitoring first. What users see depends on how they performed the "shutdown".
- If they did the "shutdown" from monconfig, then monconfig would hang at the "This might take a while..." display. If they exited monconfig (by doing a cntrl C), they would see toggle_switch running forever.
- If they did the "shutdown" by calling toggle_switch directly, they would hang forever.
- In both cases, depending on how far toggle_switch got before it hung, they would see some monitors still running as well.
- Fixes to behavior of psmctd when monitor instances change and hardware configuration doesn't, and when monitor requests the state of a device to be set to UP when it doesn't exist.
- Fixed diagmond to send a signal to psmctd daemon whenever it runs startmon_client program to re-create monitoring requests to cause psmctd to also re-create its list of state instances and monitoring requests. When hardware was added or removed, diagmond would signal psmctd. However, when startmon_client was run due to user request from monconfig, or from a monitor that determined that its instances changed (sysstat_em, fc60mon, armmon), diagmond was not signaling psmctd, it just tried to start psmctd if it wasn't started. Thus, the list of state instances displayed from set_fixed -l would not match those displayed from monconfig.
- Fixed psmctd to not allow a monitor to set the state of a resource instance to UP if that resource instances is one that was removed from the system. The set_fixed command was previously modified to not allow the user to set the state to UP. This change is an enhancement to ensure that monitors that control their own state, rather than depending on the set_fixed command, behave in the same way.
- Fix psmctd to leave a resource instances that was removed from the system marked as an instance, so when it is added back in, it is still considered an instance. If a resource instance was removed, psmctd would mark it DOWN and indicate it no longer exists and clear the instance flag. At this point, the instance would no longer be displayed in set_fixed -l. When the instance was restored, psmctd would mark the instance as existing on the system, but failed to set the flag indicating it was an instance, so it still would not be displayed in set_fixed -l, and thus could not be set to the UP state.
- Fixed two problems with startmon_client:
- Fixed time window where there would be no active monitoring requests. Improved error logging.
- Fix startmon_client to not create a monitoring request for a dummy entry in the sapcfg file.
Complete explanation: startmon_client used to remove all the EMS HW Monitor monitoring requests and then re-create the new set. This left a window where a monitor could generate an event and it would not be forwarded on to the user. The code was modified to leave the old monitoring requests active, while it created new ones and then remove any monitoring requests that were not recreated.
Some of the error messages generated for communication problems with EMS were unclear and some were completely wrong. These error messages were corrected.
startmon_client used to create monitoring requests for dummy entries added to the sapcfg files by monconfig as a placeholder when all the monitoring requests were removed. startmon_client was modified to ignore these entries.
See "Customer-Visible Interface Changes" below.
Customer-Visible Interface Changes
This section reports changes to the customer-visible interface in this release. This information is provided for the benefit of customers using scripts to drive hardware support tools to look at the output of hardware support tools.
CHANGES: The following error messages, that would be logged into /etc/opt/resmon/log/api.log were modified:
monitor 'XXX; times out waiting for resource list reply: No such file or directory
changed to
Timed out waiting for resource list reply for monitor XXXXError in get_first_config function: No such file or directory
changed to
Error in get_first_config function.Error in send_monitor_request function: No such file or directory
changed to
Error in send_monitor_request function.Error in receive_monitor_reply function: No such file or directory
changed to
Error in receive_monitor_reply function for instance XXXX.Error in set_monitor_request function: No such file or directory
changed to
Error in set_monitor_request function.monitor timed out waiting for monitor reply
changed to
Timed out waiting for monitor reply.Added:
Error in rm_get RmRequestID for monitor reply: XXXXXWhen all the monitoring requests for a monitor are removed using monconfig, it would add a dummy entry. This entry would be displayed when the C)heck monitoring requests as:
Events <= 1 (INFORMATION) Goto TEXTLOG; file=//dev///null
This dummy entry is no longer displayed and thus when all the monitoring requests are removed for a monitor, the following will be displayed when the C)heck monitoring requests is performed:
There are no monitoring requests.
CAUTION: Kernel Resource Monitor (krmond) Not Correctly Installed Over Network (HP-UX 11.00 Only)The Kernel Resource Monitor (krmond) will not be correctly installed if diagnostics are installed using Ignite-UX when booted over the network and installing from a depot. However, the process will work to Ignite the KRM product from an archive.
If you do try to install the EMS-KRMonitor product using Ignite-UX and see errors, the KRM product will not run, but nothing else will be affected.
(Within the install process, the Kernel Resource Monitor is known as the EMS-KRMonitor product.)
Affected Configurations: This problem only occurs on the Dec 2000 release of the diagnostics for HP-UX 11.00. It only occurs using Ignite_UX when booted over the network. The problem does NOT occur if the diagnostics are installed directly from a Support Plus CD-ROM or from an OnlineDiag depot downloaded from the HP Software Depot website.
Symptoms: Two errors will probably appear in the install log (swagent.log):
ERROR: Cannot install a dlkm driver. and ERROR: Cannot configure a dlkm driver.Additionally, the Kernel Resource Monitor will not run.Workaround: Due to these install problems, the EMS-KRMonitor product should be excluded from any depots that are constructed for the purpose of igniting other systems.
Reinstalling EMS-KRMonitor, outside of an Ignite-UX session, is the simplest way to get the KRM product in a usable state.
The correct swinstall option for installing this product from the 11.00 Support Plus depot must include the correct options and depot reference:
swinstall -x reinstall=true \ -s /cdrom/DIAGNOSTICS/B.11.00 EMS-KRMonitor(The \ character permits cut-and-paste of the command line.) The depot location assumes the 11.00 Support Plus CD mounted to the /cdrom directory.Background: The Kernel Resource Monitor is designed to monitor a variety of HP-UX resources (e.g., nproc or nfile), so that system administrators are informed of problems before the system panics or performance is affected. For more information, see the man page on krmond(1M).
Normally, the Kernel Resource Monitor is automatically installed when the diagnostics are installed (that is, when the OnlineDiag bundle is installed via swinstall).
CAUTION: Monitoring Changes for disc30, sdisk and disk array devicesAs of IPR 9902 (Feb 99 release), there has been a change to the way that monitoring is done for disc30, sdisk and the HA Disk Array Models 10, 20, and 30FC.
Formerly, the "diaglogd exec" programs (pdisc30_exec, pharaymon_exec, and psdisk_exec) handled driver error entries for these devices.
As of IPR 9902, these programs have been deleted and their functionality is now provided by the EMS Hardware Monitors.
If you had customized the configuration files for the diaglogd exec programs (disk30_exec.cfg, sdisk_exec.cfg, and haraymon_exec.cfg) you may wish to re-configure the EMS Hardware Monitors to achieve the same results.
CAUTION: Compatibility Problem with EMS-Related Products (ServiceGuard, HA Monitors, etc.)If you install the OnlineDiag bundle (Dec 99 or later) onto a computer running older revisions of EMS-related products, these products may experience compatibility problems Affected products include MC/ServiceGuard, ServiceGuard OPS Edition and High Availability Monitors. The only critical problems occur with the following versions:
MC/ServiceGuard A.10.10, A.11.01, A.11.03 ServiceGuard OPS Edition A.11.02, A.11.03Support Tools and the EMS hardware monitors are not affected. For complete information, see EMS Incompatibility Problem.
Monitors are provided to support the following:
In addition, a Hardware status monitor is provided to monitor the current status of the products supported by the above list.
- AutoRAID Disk Array (armmon)
- Core Hardware (dm_core_hw)
- Disk (disk_em)
- Disk Array FC60 (fc60mon)
- Fast Wide SCSI Disk Array (fw_disk_array)
- Fibre Channel Adapters (dm_FCMS_adapter)
- Fibre Channel Adapter Model A5158 (dm_TL_adapter)
- Fibre Channel Arbitrated Loop Hub (dm_fc_hub)
- Fibre Channel SCSI Multiplexer (dm_fc_scsi_mux)
- Fibre Channel Switch (dm_fc_sw)
- High Availability Disk Array (ha_disk_array)
- High Availability Storage System (dm_ses_enclosure)
- Kernel Resource (krmond)
- LPMC (lpmc_em)
- Memory (dm_memory)
- Remote (RemoteMonitor)
- SCSI Card (scsi123_em)
- SCSI Tape Devices (dm_stape)
- System Status (sysstat_em)
- UPS (dm_ups)
For detailed information concerning which products are supported by which monitors and additional dependencies, check the "Diagnostics" section of Hewlett-Packard's online documentation web site: http://docs.hp.com/hpux/diag/ .
Several of the monitors have special requirements, such as patches or certain versions of firmware. In particular:
For a list of the current required patches, see the DIAGNOSTIC.readme file for this release.
- The Fibre Channel Arbitrated Loop Hub Monitor and the Fibre Channel Switch Monitor require special configuration which is described in their data sheets in the "EMS Hardware Monitors User's Guide" (chapter 6). A patch is also required.
- A patch is required if your system includes an HP SureStore E Disk Array FC60. This patch is required to to run the EMS hardware monitor (fc60mon) or STM tools for this device.
Current monitor requirements are described in the "Supported Products" page under "EMS Hardware Monitors" at http://docs.hp.com/hpux/diag . Requirements are also listed in chapter 2 of the manual "EMS Hardware Monitors User's Guide".
Use CHART to report defects in the EMS Hardware monitors. The project name is diag.hw_mon.hpux. If you don't have access to CHART, contact an HP representative to enter a defect for you.
The EMS hardware monitors are installed as part of the OnlineDiag bundle (product number B4708AA). In addition, they utilize the EMS framework, product number B7609BA.
Note: EMS Hardware Monitors are installed as part of the STM-UUT-RUN Fileset. However, the EMS Hardware Monitors are dependent on the EMS-Core and EMS-Config products and additional filesets in the Sup-Tool-Mgr Product.
For information on the STM product, refer to the STM release notes file /usr/sbin/stm/Rel_NOTES.STM.
SD Bundle: OnlineDiag Description: On-line Diagnostic System (Series 800/700) SD PRODUCT: Sup-Tool-Mgr Description: Support Tools Manager for HP-UX Systems SD SUB-PRODUCT: Manuals Description: Support Tools Manager Manual Pages FILESET: RELEASE_NOTES Description: HPUX STM Release Notes FILESET: STM-MAN Description: HPUX STM Manual Pages SD SUB-PRODUCT: Runtime Description: STM Manual Runtime FILESET: STM-CATALOGS Description: HPUX STM Shared Libraries FILESET: STM-SHLIBS Description: HPUX STM Shared Libraries FILESET: STM-UI-RUN Description: HPUX STM User Interface FILESET: STM-UUT-RUN Description: HPUX STM Unit Under Test Runtime SD PRODUCT: EMS-Config Description: EMS Config FILESET: EMS-GUI Description: Event Monitoring Service Graphical User Interface SD PRODUCT: EMS-Core Description: EMS Core Product FILESET: EMS-CORE Description: Event Monitoring Service Core Files