[openib-general] [ANNOUCEv2] OpenIB OpenSM 1.1.0: trunk now supports 1.8.0 features

Hal Rosenstock halr at voltaire.com
Tue Sep 13 19:12:24 PDT 2005


Hi Troy,

On Tue, 2005-09-13 at 20:12, Troy Benjegerdes wrote:

Here is my analysis of the log you provided. I need to do a little more
digging. I am curious as to the switch type and firmware versions of
that switch and the failed HCA.

> At the log entry 'Sep 13 12:06:55', I plugged in the node that is hung/crashed
> .. which caused a bunch of opensm errors.. I have since unplugged that
> node, and can put it back in tommorow if you want more debug info.

At that point in time, we see the following: 

Sep 13 12:06:55 936933 [417FF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x0000000000000013
Sep 13 12:06:55 937087 [417FF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915
Sep 13 12:06:56 354422 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=11) -- dropping.
Sep 13 12:06:56 354439 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0
Sep 13 12:06:56 354449 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT).

Trap 128 is an urgent Link state of switch port changed trap.
It looks like a solicited send failed (SubnGet NodeInfo). We had an
exchange on this a while ago on the list in terms of an unresponsive
port.

Sep 13 12:06:56 363771 [40FFF970] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c90200402915 port 12. Adding to light sweep sampling list.
Sep 13 12:06:56 363815 [40FFF970] -> Directed Path Dump of 2 hop path:
                                Path = [0][1][D]

The DR display is showing the path to the switch. The dump of the SMP shows:
                                hop_ptr.................0x0
                                hop_count...............0x3
                                Initial path: [0][1][D][C]
Also, the GUID cited is an HCA GUID rather than a switch GUID so I doubt
it has 12 ports. I think these are just problems with the debug
messages.

Earlier in the log:

Sep 13 12:03:51 959970 [417FF970] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0001 GID:0xfe80000000000000,0x0002c90200402781
Sep 13 12:03:51 959986 [417FF970] -> Discovered new port with GUID:0x0002c90200402915 LID range [0xE,0xE] of node:MT47396 Infiniscale-III Mellanox Technologies 
Sep 13 12:03:51 959996 [417FF970] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0001 GID:0xfe80000000000000,0x0002c90200402781

It appears that the failed node is a MT47396 off switch 0x0002c90200402781.
What firmware version is running in both of these ? What is switch 0x0002c90200402781 ?

A minor issue but the DR display above is not correct. The dump of the SMP shows:
                                hop_ptr.................0x0
                                hop_count...............0x3
                                Initial path: [0][1][D][C]

It seems to repeat this over and over again every few seconds until
things break I presume at 12:07:57.

The key to me is that OpenSM continues to receive: 
Sep 13 12:07:23 542642 [40FFF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x000000000000002c
Sep 13 12:07:23 542771 [40FFF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915

Either OpenSM never shuts this off or it keeps bouncing the port in the
light sweep. I need to investigate this further.

It all ends when: 
Sep 13 12:07:56 574831 [40FFF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x0000000000000057
Sep 13 12:07:56 574961 [40FFF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915
Sep 13 12:07:56 719968 [417FF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x0000000000000058
Sep 13 12:07:56 720052 [417FF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915

and then that switch returns a bad status in a SM GetResp PortInfo (in
response to a SM Set PortInfo):

Sep 13 12:07:57 005832 [42FFF970] -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x81 (SubnGetResp)
                                D bit...................0x1
                                status..................0x1C00
                                hop_ptr.................0x0
                                hop_count...............0x2
                                trans_id................0x455a
                                attr_id.................0x15 (PortInfo)
                                resv....................0x0
                                attr_mod................0xC
                                m_key...................0x0000000000000000
                                dr_slid.................0xFFFF
                                dr_dlid.................0xFFFF

                                Initial path: [0][1][D]
                                Return path:  [0][1][18]
                                Reserved:     [0][0][0][0][0][0][0]

                                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 18 03 03 02

                                31 22 00 13 40 40 00 08   08 04 F2 40 00 00 00 00

                                00 00 00 00 00 88 00 00   00 00 00 00 00 00 00 00

Sep 13 12:07:57 005891 [40FFF970] -> osm_pi_rcv_process_set: ERR 0F10: Received Error Status for SetResp()
Sep 13 12:07:57 005908 [40FFF970] -> PortInfo dump:
                                port number.............0xC
                                node_guid...............0x0002c90200402915
                                port_guid...............0x0002c90200402915
                                m_key...................0x0000000000000000
                                subnet_prefix...........0x0000000000000000
                                base_lid................0x0
                                master_sm_base_lid......0x0
                                capability_mask.........0x0
                                diag_code...............0x0
                                m_key_lease_period......0x0
                                local_port_num..........0x18
                                link_width_enabled......0x3
                                link_width_supported....0x3
                                link_width_active.......0x2
                                link_speed_supported....0x3
                                port_state..............DOWN
                                state_info2.............0x22
                                m_key_protect_bits......0x0
                                lmc.....................0x0
                                link_speed..............0x13
                                mtu_smsl................0x40
                                vl_cap_init_type........0x40
                                vl_high_limit...........0x0
                                vl_arb_high_cap.........0x8
                                vl_arb_low_cap..........0x8
                                init_rep_mtu_cap........0x4
                                vl_stall_life...........0xF2
                                vl_enforce..............0x40
                                m_key_violations........0x0
                                p_key_violations........0x0
                                q_key_violations........0x0
                                guid_cap................0x0
                                subnet_timeout..........0x0
                                resp_time_value.........0x0
                                error_threshold.........0x88
Sep 13 12:07:57 005951 [40FFF970] -> Capabilities Mask:

That is when things stop working. Likely multicast in that switch is not
working. I'd be curious whether the multicast setup in that switch is
trashed or not. That can be determined with the diag tools. Let me know
if you would like me to document the procedure for this.

There is a pending issue with Sets of PortInfo getting this status back
which has been on this list. Not sure whether this is a related problem
or not.

-- Hal




More information about the general mailing list