These release notes cover the September 2000 (IPR 0009) release of Support Plus for HP-UX 11.00/10.20 running on S800/S700 systems.
- Overview
- Configuring Hardware Monitoring
- Documentation
- Changes
- Customer-Visible Interface Changes
- Known Problems
- Monitors Provided
- Monitor Dependencies
- Defect Reporting
- SD Product Structure
NOTE: As of the September 1999 release, the name of the Diagnostic/IPR Media has been changed to Support Plus. In addition, the format has changed so that there is a separate CD-ROM for each version of the operating system (HP-UX 10.20 and HP-UX 11.0).
Included on the Support Plus CD-ROM are the EMS Hardware Monitors - an important tool for maintaining system availability. The EMS hardware monitors allow you to monitor the operation of a wide variety of hardware products and be alerted immediately if any failure or other unusual event occurs. Hardware event monitoring is available to users running HP-UX 10.20 or 11.X (IPR 9902 and later).
Hardware event monitoring provides a high level of protection against system hardware failure. By using hardware event monitoring, you can virtually eliminate undetected hardware failures that could interrupt system operation or cause data loss.
Configuring Hardware Monitoring
The EMS Hardware Monitors are installed at the same time as the Support Tools Manager. Once the monitoring software is installed, monitoring is automatically enabled.
By default, messages regarding major warning, serious and critical events that occur on hardware being monitored will be:
All events will be stored in /var/opt/resmon/log/event.log.
- Written to /var/adm/syslog/syslog.log
- Sent to EMAIL address root
To configure, enable, or disable hardware event monitoring, run the monitoring request manager: /etc/opt/resmon/lbin/monconfig .
The Peripheral Status Monitor (PSM) and the The Kernel Resource Monitor (krmond) are configured differently. They use the EMS GUI. See: http://docs.hp.com/hpux/onlinedocs/diag/ems/ems_gui.htm Documentation
For the latest and most complete information on EMS Hardware Monitors and the Support Tools Manager (STM), see the Web page "Diagnostics":
http://docs.hp.com/hpux/diag/At this site, you will find Overviews, Tutorials, Quick Reference Cards, Frequently Asked Questions (FAQs), and much other material.For complete information on installing and using EMS hardware monitors, as well as a list of supported hardware, refer to the "EMS Hardware Monitors User's Guide" available at the above site. An electronic copy of this book is also included on the Support Plus CD-ROM in the <mount_point>/DIAGNOSTICS directory.
Changes in the EMS Hardware Monitors for the the September 2000 (IPR 0009) release include:
- New monitor: the UPS Monitor (dm_ups) This monitor supports HP Uninterruptible Power Systems (UPSs), including HP PowerTrust, HP PowerTrust II, Explorer UPS. It is only supported on S800 systems, since it requires the HP-UX monitoring daemon, ups_mond, which is only supplied on Series 800 systems and not on Series 700 systems (workstations).
- New monitor: the Remote Monitor (RemoteMonitor). This monitor supports all devices managed by HP device management software, currently HP A6188A storage array (codename Low-end "Cassini") and HP A6189A storage array (codename High-end "Cassini"). Current plans are for many different types of devices to be supported, including disk drives, disk arrays, disk jbods, tape drives, tape libraries, FC hubs, switches and bridges.
- Enhanced A5158A Fibre Channel Adapter Monitor (dm_TL_adapter) to be multiple-view (Predictive-enabled). This dm_TL_adapter monitor is only supported on HP-UX 11.x.
- Enhanced Fibre Channel Adapters Monitor (dm_FCMS_adapter) to be multiple-view (Predictive-enabled).
- Fixed a problem that prevented some EMS monitors from being able to register, in particular the Disk Monitor (disk_em). Symptoms of this problem include entries in the diaglogd activity log showing the unsuccessful attempts of a monitor trying to register. Excessive CPU usage by diaglogd is another symptom.
- Fixed problem with the System Status monitor (sysstat_em), whereby if the diagmond daemon is not running on a system being monitored, the monitor will report that diagmond is not running on all systems from then on.
- Fixed a problem with the Disk Monitor (disk_em), whereby disk_em was using excessive amounts of CPU time while monitoring and generating events for XP256/XP512 devices.
- Fixed "ILLEGAL REQUEST" problems that occurred in the SCSI Tape Devices Monitor (dm_stape).
- Made several enhancements to the SCSI Tape monitor (dm_stape):
- Modified perform_polling.c to add Serial Number to list of component data items logged in event.log file.
- Modified default_dm_stape.clcfg to enable logging Serial Number.
- Added test code to perform_polling.c file to force events to occur.
- Added self installer script file dm_stape.install
- Added file update program dm_stape_wizard.c
- Added support to Core Hardware Monitor (dm_core_hw) for L-Class, Models L3000 (codename "Marcato").
- Fixed a problem in the EMS Hardware Monitors that occurs if the customer installs the EMS p-client patch that causes p-client to hold off all other clients until it has completed processing the persistence files. If customers install this patch, without this fix, the customer will see all the monitoring requests for the below listed Hardware Monitors disappear after a reboot and no instances will be listed for those monitors. The monitoring requests can only be recovered if monitoring is disabled and then re-enabled after the reboot. Affected monitors:
- scsi123
- ses_enclosure
- sysstat_em
- scsi_cascade
- fc_scsi_mux
- ha_disk_array
- scsi_disk
- fw_disk_array
- Added STM and EMS versions to monconfig header. See Customer-Visible Interface Changes below. (UPDATE: Sept 12. The STM version that appears in the monconfig header is incorrect. This should be fixed in an upcoming version of diagnostics.)
- Added more possible exit error messages for psmmon. See Customer-Visible Interface Changes below.
- Changes to the wording and numbering of 17 SCSI event messages. See Customer-Visible Interface Changes below.
- Modified Peripheral Status monitor functionality to set state of removed resource instances to DOWN and not allow the customer to set them back to UP until they are restored to the system.
The Peripheral Status monitor depends on monitors to indicate when devices should be set to the DOWN state. Unfortunately, there were race conditions where a device would be removed from the system and the OS would remove it from the I/O table used by monitors to determine what devices they should be watching, before the monitor could determine that the device should be DOWN. Thus, the state of devices could be UP when they are actually removed from the system. In addition, even if the monitor did notice that the device was removed and set the state to DOWN, it would soon stop monitoring the device. Then, the customer would set the state back to UP, assuming that it was still being monitored and be confused when the state didn't change back to DOWN again. So, the Peripheral Status Monitor was modified to set the state of any device which is no longer recognized by any Sentinel monitor to the DOWN state. The set_fixed command was also modified to not allow a customer to set the state of a device to UP when that device was set to DOWN due to removal from the system. The device must be returned to the system and the monitor again recognize the device and start monitoring it before the state can be set to UP.
- Added functionality to allow the user to disable monitoring of a particular instance. A detailed description follows:
Added functionality to the startmon_client binary to check a list of instances in the /var/stm/data/tools/monitor/disabled_instances file and not create monitoring requests from the *.sapcfg files for those instances.
The format of the disabled_instances file is a text file, with each fully qualified instances listed, one instance per line. In addition, wildcards can be used in the instance names to specify more than more instance. For example: /storage/events/disks/default/* could be used to specify all the instances associated with the default disk resource names, or /storage/events/* could be used to disable all the instances for all storage.
For those instances listed in the disabled instance file, no monitoring requests, with the exception of those created by psmctd (shown with a TCP as target), will show up in the list displayed by the monconfig "C)heck monitoring" command.
NOTE: This does not mean that the monitor will stop polling the device, it just means that any events will not be forwarded to the user based on information in the *.sapcfg files. In addition, it does NOT mean that psmctd/psmmon will stop watching the device either, so the device could go into the DOWN state even though the user didn't see any events.
In order to use the disabled_instances file, the user must perform the following:
- Add/delete/modify instances in the disabled_instances file
- Run monconfig
- Select the "E)nable Monitoring" command
- Wait for monitoring to be re-enabled
- Modified psmctd to remove monitor requests from previous instances of psmctd when receive a SIGUSR1 signal.
Psmctd removes monitor requests when it starts and when it exits. However, there are corner cases where all the monitor requests from previous instances of psmctd are not available from EMS to be removed when psmctd starts or exits. This is especially likely at reboot or when monitors are slow to come up or if diaglogd is slow to come up. So, old monitor requests are left hanging around.
Psmctd has been modified to remove the monitor requests that it has not created whenever it receives a SIGUSR1. Psmctd receives a SIGUSR1 whenever monitoring is enabled/disabled or when the ioscan data changes. This fix plugs the last hole where psmctd could leave old monitoring requests active.
- Fixed a problem that had two symptoms:
The tldecmon library had active debug statements that were not being logged but were causing performance problems. diaglogd would stop sending OS errors to any monitor until monitoring was stopped and restarted, if the monitor's FIFO was ever filled. The FIFO could fill if there were many OS errors for that monitor in a short amount of time. (NOTE: As of Aug 30, this problem is still under investigation. It may not be completely fixed in the 09 release. As more information becomes available, it will be posted here.)
- Failure to get events from the monitor for problems logged from the OS
- Continuously growing diaglogd_hold_list file in /var/stm/data
Customer-Visible Interface Changes
This section reports changes to the customer-visible interface in this release. This information is provided for the benefit of customers using scripts to drive hardware support tools to look at the output of hardware support tools.
CHANGE: In the IPR 0009 and HP-UX 11i release, the header displayed when "monconfig" is executed has been changed to include the STM and EMS version numbers.
BEFORE:
============================================================================ =================== Event Monitoring Service =================== =================== Monitoring Request Manager =================== ============================================================================ EVENT MONITORING IS CURRENTLY ENABLED.AFTER:============================================================================ =================== Event Monitoring Service =================== =================== Monitoring Request Manager =================== ============================================================================ EVENT MONITORING IS CURRENTLY ENABLED. EMS Version : A.03.10 STM Version : A.22.10
CHANGE: Added more possible exit error messages for psmmon. These messages may be logged into the /etc/opt/resmon/log/api.log file when the monitor exits abnormally:
------------------------Start Event-------------------------------- User event occurred at Tue May 23 14:14:38.789544 2000 Process ID: 1246 (/usr/sbin/stm/uut/bin/tools/.../psmmon) Log Level: Error /usr/sbin/stm/uut/bin/tools/monitor/psmmon: Exiting due to receipt of signal 11. ------------------------Start Event-------------------------------- ------------------------Start Event-------------------------------- User event occurred at Tue May 23 14:14:38.789544 2000 Process ID: 1246 (/usr/sbin/stm/uut/bin/tools/.../psmmon) Log Level: Error /usr/sbin/stm/uut/bin/tools/monitor/psmmon: Exiting due to SIGINT signal. ------------------------Start Event-------------------------------- ------------------------Start Event-------------------------------- User event occurred at Tue May 23 14:14:38.789544 2000 Process ID: 1246 (/usr/sbin/stm/uut/bin/tools/.../psmmon) Log Level: Error /usr/sbin/stm/uut/bin/tools/monitor/psmmon: Exiting due to error with exit value 0xXX. ------------------------Start Event-------------------------------- ------------------------Start Event-------------------------------- User event occurred at Tue May 23 14:14:38.789544 2000 Process ID: 1246 (/usr/sbin/stm/uut/bin/tools/.../psmmon) Log Level: Info /usr/sbin/stm/uut/bin/tools/monitor/psmmon: Exiting normally. ------------------------Start Event--------------------------------
In the IPR 0009 and HP-UX 11i release, about 17 changes were made to text under the "Description/Cause Action" and "Details" headings in the Default SCSI events generated and decoded by SCSI Device monitors/decoders. These events may be reported by any hardware monitor for SCSI devices.
-----------
For Event #100837, 100937, 101826, and 101726, the "Details" text did not display the Additional Sense Code and Additional Sense Qualifier description text. The following text was added:100837
The combination of Additional Sense Code and Sense Qualifier (0x110b) indicates: Unrecovered read error. Recommend reassignment.100937
The combination of Additional Sense Code and Sense Qualifier (0x110c) indicates: Unrecovered read error. Recommend rewrite.101726
The combination of Additional Sense Code and Sense Qualifier (0x1805) indicates: Recovered data. Recommend reassign.101826
The combination of Additional Sense Code and Sense Qualifier (0x1806) indicates: Recovered data. Recommend rewrite.---------
Event #100068, Detail text decoding of Additional Sense Code and Additional Sense Qualifier was incorrect. It indicated "Ram Failure". Correct decoding is:
The combination of Additional Sense Code and Sense Qualifier (0x4000) indicates: Power-on or self-test failure for FRU indicated by sense code qualifier.NOTE: valid values for the Additional Sense Code and Sense Qualifier for event #100068 range from 0x4000 to 0x400FF, where 0x40 is the Additional Sense Code.
-------------
Event #100837, 100937, 100208, 101126, 101026, 100271, 100872, the Description of the Error changed:100837
The device was unsuccessful in reading the data for the current I/O request. Reassignment to a spare area on the medium is recommended.100937
The device was unsuccessful in reading the data for the current I/O request. Rewriting the data is recommended.100208
The medium in the device is incompatible with the device.101126
The device was unsuccessful in its first attempt at reading the data requested in an I/O request, but was able to recover it. The requested data was successfully returned. Rewriting the data is recommended.101026
The device was unsuccessful in its first attempt at reading the data requested in an I/O request, but was able to recover it. The requested data was successfully returned. Reassignment to a spare area on the medium is recommended.100271
The device aborted the command. The initiator may be able to recover by retrying the command.100872
The device aborted the command. The initiator may be able to recover by retrying the command.-----------
Event 100208, 101126, 101826, 100999 (formerly 100299), the Cause/Action text was changed:100208
Replace the medium with one that is compatible with the device.101126
Rewrite of the data on the medium is recommended.101826
Rewrite of the data on the medium is recommended.100999
The error most likely indicates that the device is not fully supported by the current driver. This may or may not cause a problem in the operation of the device.------------
Event #100999 replaces Event #100299.
CAUTION: Monitoring Changes for disc30, sdisk and disk array devicesAs of IPR 9902 (Feb 99 release), there has been a change to the way that monitoring is done for disc30, sdisk and the HA Disk Array Models 10, 20, and 30FC.
Formerly, the "diaglogd exec" programs (pdisc30_exec, pharaymon_exec, and psdisk_exec) handled driver error entries for these devices.
As of IPR 9902, these programs have been deleted and their functionality is now provided by the EMS Hardware Monitors.
If you had customized the configuration files for the diaglogd exec programs (disk30_exec.cfg, sdisk_exec.cfg, and haraymon_exec.cfg) you may wish to re-configure the EMS Hardware Monitors to achieve the same results.
CAUTION: Compatibility Problem with EMS-Related Products (ServiceGuard, HA Monitors, etc.)If you install the OnlineDiag bundle (Dec 99 or later) onto a computer running older revisions of EMS-related products, these products may experience compatibility problems Affected products include MC/ServiceGuard, ServiceGuard OPS Edition and High Availability Monitors. The only critical problems occur with the following versions:
MC/ServiceGuard A.10.10, A.11.01, A.11.03 ServiceGuard OPS Edition A.11.02, A.11.03Support Tools and the EMS hardware monitors are not affected. For complete information, see EMS Incompatibility Problem.
Monitors are provided to support the following:
In addition, a Hardware status monitor is provided to monitor the current status of the products supported by the above list.
- AutoRAID Disk Array (armmon)
- Core Hardware (dm_core_hw)
- Disk (disk_em)
- Disk Array FC60 (fc60mon)
- Fast Wide SCSI Disk Array (fw_disk_array)
- Fibre Channel Adapters (dm_FCMS_adapter)
- Fibre Channel Adapter Model A5158 (dm_TL_adapter)
- Fibre Channel Arbitrated Loop Hub (dm_fc_hub)
- Fibre Channel SCSI Multiplexer (dm_fc_scsi_mux)
- Fibre Channel Switch (dm_fc_sw)
- High Availability Disk Array (ha_disk_array)
- High Availability Storage System (dm_ses_enclosure)
- Kernel Resource (krmond)
- LPMC (lpmc_em)
- Memory (dm_memory)
- Remote (RemoteMonitor) NEW for Sept 2000 release
- SCSI Card (scsi123_em)
- SCSI Tape Devices (dm_stape)
- System Status (sysstat_em)
- UPS (dm_ups) NEW for Sept 2000 release
For detailed information concerning which products are supported by which monitors and additional dependencies, check the "Diagnostics" section of Hewlett-Packard's online documentation web site: http://docs.hp.com/hpux/diag/ .
Several of the monitors have special requirements, such as patches or certain versions of firmware. In particular:
For a list of the current required patches, see the DIAGNOSTIC.readme file for this release.
- The Fibre Channel Arbitrated Loop Hub Monitor and the Fibre Channel Switch Monitor require special configuration which is described in their data sheets in the "EMS Hardware Monitors User's Guide" (chapter 6). A patch is also required.
- A patch is required if your system includes an HP SureStore E Disk Array FC60. This patch is required to to run the EMS hardware monitor (fc60mon) or STM tools for this device.
Current monitor requirements are described in the "Supported Products" page under "EMS Hardware Monitors" at http://docs.hp.com/hpux/diag . Requirements are also listed in chapter 2 of the manual "EMS Hardware Monitors User's Guide".
Use CHART to report defects in the EMS Hardware monitors. The project name is diag.hw_mon.hpux. If you don't have access to CHART, contact an HP representative to enter a defect for you.
The EMS hardware monitors are installed as part of the OnlineDiag bundle (product number B4708AA). In addition, they utilize the EMS framework, product number B7609BA.
Note: EMS Hardware Monitors are installed as part of the STM-UUT-RUN Fileset. However, the EMS Hardware Monitors are dependent on the EMS-Core and EMS-Config products and additional filesets in the Sup-Tool-Mgr Product.
For information on the STM product, refer to the STM release notes file /usr/sbin/stm/Rel_NOTES.STM.
SD Bundle: OnlineDiag Description: On-line Diagnostic System (Series 800/700) SD PRODUCT: Sup-Tool-Mgr Description: Support Tools Manager for HP-UX Systems SD SUB-PRODUCT: Manuals Description: Support Tools Manager Manual Pages FILESET: RELEASE_NOTES Description: HPUX STM Release Notes FILESET: STM-MAN Description: HPUX STM Manual Pages SD SUB-PRODUCT: Runtime Description: STM Manual Runtime FILESET: STM-CATALOGS Description: HPUX STM Shared Libraries FILESET: STM-SHLIBS Description: HPUX STM Shared Libraries FILESET: STM-UI-RUN Description: HPUX STM User Interface FILESET: STM-UUT-RUN Description: HPUX STM Unit Under Test Runtime SD PRODUCT: EMS-Config Description: EMS Config FILESET: EMS-GUI Description: Event Monitoring Service Graphical User Interface SD PRODUCT: EMS-Core Description: EMS Core Product FILESET: EMS-CORE Description: Event Monitoring Service Core Files