[openib-general] [ANNOUCEv2] OpenIB OpenSM 1.1.0: trunk now supports 1.8.0 features
Hal Rosenstock
halr at voltaire.com
Tue Sep 13 19:12:24 PDT 2005
Hi Troy,
On Tue, 2005-09-13 at 20:12, Troy Benjegerdes wrote:
Here is my analysis of the log you provided. I need to do a little more
digging. I am curious as to the switch type and firmware versions of
that switch and the failed HCA.
> At the log entry 'Sep 13 12:06:55', I plugged in the node that is hung/crashed
> .. which caused a bunch of opensm errors.. I have since unplugged that
> node, and can put it back in tommorow if you want more debug info.
At that point in time, we see the following:
Sep 13 12:06:55 936933 [417FF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x0000000000000013
Sep 13 12:06:55 937087 [417FF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915
Sep 13 12:06:56 354422 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=11) -- dropping.
Sep 13 12:06:56 354439 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0
Sep 13 12:06:56 354449 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT).
Trap 128 is an urgent Link state of switch port changed trap.
It looks like a solicited send failed (SubnGet NodeInfo). We had an
exchange on this a while ago on the list in terms of an unresponsive
port.
Sep 13 12:06:56 363771 [40FFF970] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c90200402915 port 12. Adding to light sweep sampling list.
Sep 13 12:06:56 363815 [40FFF970] -> Directed Path Dump of 2 hop path:
Path = [0][1][D]
The DR display is showing the path to the switch. The dump of the SMP shows:
hop_ptr.................0x0
hop_count...............0x3
Initial path: [0][1][D][C]
Also, the GUID cited is an HCA GUID rather than a switch GUID so I doubt
it has 12 ports. I think these are just problems with the debug
messages.
Earlier in the log:
Sep 13 12:03:51 959970 [417FF970] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0001 GID:0xfe80000000000000,0x0002c90200402781
Sep 13 12:03:51 959986 [417FF970] -> Discovered new port with GUID:0x0002c90200402915 LID range [0xE,0xE] of node:MT47396 Infiniscale-III Mellanox Technologies
Sep 13 12:03:51 959996 [417FF970] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0001 GID:0xfe80000000000000,0x0002c90200402781
It appears that the failed node is a MT47396 off switch 0x0002c90200402781.
What firmware version is running in both of these ? What is switch 0x0002c90200402781 ?
A minor issue but the DR display above is not correct. The dump of the SMP shows:
hop_ptr.................0x0
hop_count...............0x3
Initial path: [0][1][D][C]
It seems to repeat this over and over again every few seconds until
things break I presume at 12:07:57.
The key to me is that OpenSM continues to receive:
Sep 13 12:07:23 542642 [40FFF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x000000000000002c
Sep 13 12:07:23 542771 [40FFF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915
Either OpenSM never shuts this off or it keeps bouncing the port in the
light sweep. I need to investigate this further.
It all ends when:
Sep 13 12:07:56 574831 [40FFF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x0000000000000057
Sep 13 12:07:56 574961 [40FFF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915
Sep 13 12:07:56 719968 [417FF970] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E TID:0x0000000000000058
Sep 13 12:07:56 720052 [417FF970] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x0002c90200402915
and then that switch returns a bad status in a SM GetResp PortInfo (in
response to a SM Set PortInfo):
Sep 13 12:07:57 005832 [42FFF970] -> SMP dump:
base_ver................0x1
mgmt_class..............0x81
class_ver...............0x1
method..................0x81 (SubnGetResp)
D bit...................0x1
status..................0x1C00
hop_ptr.................0x0
hop_count...............0x2
trans_id................0x455a
attr_id.................0x15 (PortInfo)
resv....................0x0
attr_mod................0xC
m_key...................0x0000000000000000
dr_slid.................0xFFFF
dr_dlid.................0xFFFF
Initial path: [0][1][D]
Return path: [0][1][18]
Reserved: [0][0][0][0][0][0][0]
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 18 03 03 02
31 22 00 13 40 40 00 08 08 04 F2 40 00 00 00 00
00 00 00 00 00 88 00 00 00 00 00 00 00 00 00 00
Sep 13 12:07:57 005891 [40FFF970] -> osm_pi_rcv_process_set: ERR 0F10: Received Error Status for SetResp()
Sep 13 12:07:57 005908 [40FFF970] -> PortInfo dump:
port number.............0xC
node_guid...............0x0002c90200402915
port_guid...............0x0002c90200402915
m_key...................0x0000000000000000
subnet_prefix...........0x0000000000000000
base_lid................0x0
master_sm_base_lid......0x0
capability_mask.........0x0
diag_code...............0x0
m_key_lease_period......0x0
local_port_num..........0x18
link_width_enabled......0x3
link_width_supported....0x3
link_width_active.......0x2
link_speed_supported....0x3
port_state..............DOWN
state_info2.............0x22
m_key_protect_bits......0x0
lmc.....................0x0
link_speed..............0x13
mtu_smsl................0x40
vl_cap_init_type........0x40
vl_high_limit...........0x0
vl_arb_high_cap.........0x8
vl_arb_low_cap..........0x8
init_rep_mtu_cap........0x4
vl_stall_life...........0xF2
vl_enforce..............0x40
m_key_violations........0x0
p_key_violations........0x0
q_key_violations........0x0
guid_cap................0x0
subnet_timeout..........0x0
resp_time_value.........0x0
error_threshold.........0x88
Sep 13 12:07:57 005951 [40FFF970] -> Capabilities Mask:
That is when things stop working. Likely multicast in that switch is not
working. I'd be curious whether the multicast setup in that switch is
trashed or not. That can be determined with the diag tools. Let me know
if you would like me to document the procedure for this.
There is a pending issue with Sets of PortInfo getting this status back
which has been on this list. Not sure whether this is a related problem
or not.
-- Hal
More information about the general
mailing list