[openib-general] [ANNOUCEv2] OpenIB OpenSM 1.1.0: trunk now supports 1.8.0 features

Wed Sep 14 07:01:13 PDT 2005

Hal,
    Let's see how many of these I can tackle... There are two switches 
in the setup and both are brand new 24 port DDR2 switches from Mellanox 
(sorry i don't know the switch part off the top of my head). Most of 
the NICs are rev a1 based NICs that have a fairly recent firmware on 
them. Though the opensm is running on a MT25208 InfiniHost III Ex in a 
dual opteron. The node that failed is an 8 way IBM pSeries p655 with:
[  176.575945] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 
2005)
[  176.674674] ib_mthca: Initializing Mellanox Technologies MT23108 
InfiniHost (0001:62:00.0)
[  176.800432] PCI: Enabling device: (0001:62:00.0), cmd 142

lspci on a matching node gives:
0001:62:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
         Subsystem: Mellanox Technology MT23108 InfiniHost
         Flags: bus master, 66MHz, medium devsel, latency 144, IRQ 201
         Memory at d8800000 (64-bit, non-prefetchable) [size=1M]
         Memory at d8000000 (64-bit, prefetchable) [size=8M]
         Capabilities: <available only to root>

The pSeries nodes have firmware to hide the onboard memory due to some 
issues with openfirmware... At this point the node that had the issue 
is in a weird state where I can still login and perform some commands, 
but some fail (lspci hangs) and its routing is a bit screwed up as it 
can't see hosts over the ethernet properly either. I think that is 
because it crashed in the ipoib code. Here is the kernel oops:
[2507694.118336] Oops: Kernel access of bad area, sig: 11 [#1]
[2507694.131400] SMP NR_CPUS=8 PSERIES
[2507694.139788] Modules linked in: pvfs2 ib_ipoib ib_sa ib_mthca 
ib_mad ib_core
[2507694.156537] NIP: D0000000006216A0 XER: 20000000 LR: 
D000000000621680 CTR: C0000000001C5CD8
[2507694.176313] REGS: c0000007fe73f8a0 TRAP: 0300   Not tainted  
(2.6.12.3-power4)
[2507694.193613] MSR: 9000000000001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 
11 CR: 24000084
[2507694.211341] DAR: 0000000000000008 DSISR: 0000000042000000
[2507694.224396] TASK: c0000000081e1000[23] 'events/5' THREAD: 
c0000007fe73c000 CPU: 5
[2507694.241849] GPR00: D000000000621680 C0000007FE73FB20 
D000000000636A98 C00000079D14A2C0
[2507694.261054] GPR04: D00000000062E648 C00000003FA4B080 
0000000000000000 0000000000000000
[2507694.280223] GPR08: 0000000000000000 0000000000000000 
C00000003FA557D0 C00000077CA81E80
[2507694.299362] GPR12: D0000000006289D0 C00000000052BC00 
0000000000000000 0000000000000000
[2507694.318568] GPR16: 0000000000000000 0000000000000000 
0000000003A10000 000000000291FE84
[2507694.337756] GPR20: 0000000000000038 0000000003EB8E28 
0000000800000000 C0000000041B1AC0
[2507694.356925] GPR24: C00000002CE34380 9000000000009032 
C00000003FA557E8 C00000003FA55780
[2507694.376130] GPR28: 0000000000000000 C0000007918252C0 
D000000000635F80 C0000007918252C0
[2507694.395650] NIP [d0000000006216a0] .path_free+0x1a8/0x26c 
[ib_ipoib]
[2507694.410972] LR [d000000000621680] .path_free+0x188/0x26c [ib_ipoib]
[2507694.426015] Call Trace:
[2507694.432189] [c0000007fe73fb20] [d000000000621680] 
.path_free+0x188/0x26c [ib_ipoib] (unreliable)
[2507694.453198] [c0000007fe73fbd0] [d000000000621864] 
.ipoib_flush_paths+0x100/0x148 [ib_ipoib]
[2507694.473184] [c0000007fe73fc80] [d0000000006249c0] 
.ipoib_ib_dev_down+0x13c/0x194 [ib_ipoib]
[2507694.493149] [c0000007fe73fd20] [d000000000625004] 
.ipoib_ib_dev_flush+0x44/0xac [ib_ipoib]
[2507694.512946] [c0000007fe73fdb0] [c00000000005ca0c] 
.worker_thread+0x244/0x318
[2507694.529880] [c0000007fe73fee0] [c0000000000630e4] 
.kthread+0x154/0x1a4
[2507694.545601] [c0000007fe73ff90] [c000000000013508] 
.kernel_thread+0x4c/0x6c
[2507694.562225] Instruction dump:
[2507694.569557] 38630020 419effb8 e89e8008 48007355 e8410028 e93d0020 
7fa3eb78 fb890058
[2507694.588139] 60000000 e97d0020 7ffdfb78 e92b00d8 <fb890008> 
480072fd e8410028 381f0028
[2507694.607156]

What is the procedure for determining if the multicast setup on the 
switch is trashed?

I suspect that if it is, the crashed node is causing it as I had power 
cycled the switch yesterday which seemed to get things working up until 
I plugged the crashed node in again.

Brett
On Sep 13, 2005, at 9:12 PM, Hal Rosenstock wrote:

> Hi Troy,
>
> On Tue, 2005-09-13 at 20:12, Troy Benjegerdes wrote:
>
> Here is my analysis of the log you provided. I need to do a little more
> digging. I am curious as to the switch type and firmware versions of
> that switch and the failed HCA.
>
>> At the log entry 'Sep 13 12:06:55', I plugged in the node that is 
>> hung/crashed
>> .. which caused a bunch of opensm errors.. I have since unplugged that
>> node, and can put it back in tommorow if you want more debug info.
>
> At that point in time, we see the following:
>
> Sep 13 12:06:55 936933 [417FF970] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E 
> TID:0x0000000000000013
> Sep 13 12:06:55 937087 [417FF970] -> osm_report_notice: Reporting 
> Generic Notice type:1 num:128 from LID:0x0000 
> GID:0xfe80000000000000,0x0002c90200402915
> Sep 13 12:06:56 354422 [42FFF970] -> umad_receiver: ERR 5409: send 
> completed with error (method=1 attr=11) -- dropping.
> Sep 13 12:06:56 354439 [42FFF970] -> umad_receiver: ERR 5411: DR SMP 
> hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0
> Sep 13 12:06:56 354449 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: 
> ERR 3113: MAD completed in error (IB_TIMEOUT).
>
> Trap 128 is an urgent Link state of switch port changed trap.
> It looks like a solicited send failed (SubnGet NodeInfo). We had an
> exchange on this a while ago on the list in terms of an unresponsive
> port.
>
> Sep 13 12:06:56 363771 [40FFF970] -> osm_drop_mgr_process: ERR 0108: 
> Unknown remote side for node 0x0002c90200402915 port 12. Adding to 
> light sweep sampling list.
> Sep 13 12:06:56 363815 [40FFF970] -> Directed Path Dump of 2 hop path:
>                                 Path = [0][1][D]
>
> The DR display is showing the path to the switch. The dump of the SMP 
> shows:
>                                 hop_ptr.................0x0
>                                 hop_count...............0x3
>                                 Initial path: [0][1][D][C]
> Also, the GUID cited is an HCA GUID rather than a switch GUID so I 
> doubt
> it has 12 ports. I think these are just problems with the debug
> messages.
>
> Earlier in the log:
>
> Sep 13 12:03:51 959970 [417FF970] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:64 from LID:0x0001 
> GID:0xfe80000000000000,0x0002c90200402781
> Sep 13 12:03:51 959986 [417FF970] -> Discovered new port with 
> GUID:0x0002c90200402915 LID range [0xE,0xE] of node:MT47396 
> Infiniscale-III Mellanox Technologies
> Sep 13 12:03:51 959996 [417FF970] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:64 from LID:0x0001 
> GID:0xfe80000000000000,0x0002c90200402781
>
> It appears that the failed node is a MT47396 off switch 
> 0x0002c90200402781.
> What firmware version is running in both of these ? What is switch 
> 0x0002c90200402781 ?
>
> A minor issue but the DR display above is not correct. The dump of the 
> SMP shows:
>                                 hop_ptr.................0x0
>                                 hop_count...............0x3
>                                 Initial path: [0][1][D][C]
>
> It seems to repeat this over and over again every few seconds until
> things break I presume at 12:07:57.
>
> The key to me is that OpenSM continues to receive:
> Sep 13 12:07:23 542642 [40FFF970] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E 
> TID:0x000000000000002c
> Sep 13 12:07:23 542771 [40FFF970] -> osm_report_notice: Reporting 
> Generic Notice type:1 num:128 from LID:0x0000 
> GID:0xfe80000000000000,0x0002c90200402915
>
> Either OpenSM never shuts this off or it keeps bouncing the port in the
> light sweep. I need to investigate this further.
>
> It all ends when:
> Sep 13 12:07:56 574831 [40FFF970] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E 
> TID:0x0000000000000057
> Sep 13 12:07:56 574961 [40FFF970] -> osm_report_notice: Reporting 
> Generic Notice type:1 num:128 from LID:0x0000 
> GID:0xfe80000000000000,0x0002c90200402915
> Sep 13 12:07:56 719968 [417FF970] -> __osm_trap_rcv_process_request: 
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E 
> TID:0x0000000000000058
> Sep 13 12:07:56 720052 [417FF970] -> osm_report_notice: Reporting 
> Generic Notice type:1 num:128 from LID:0x0000 
> GID:0xfe80000000000000,0x0002c90200402915
>
> and then that switch returns a bad status in a SM GetResp PortInfo (in
> response to a SM Set PortInfo):
>
> Sep 13 12:07:57 005832 [42FFF970] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x81 
> (SubnGetResp)
>                                 D bit...................0x1
>                                 status..................0x1C00
>                                 hop_ptr.................0x0
>                                 hop_count...............0x2
>                                 trans_id................0x455a
>                                 attr_id.................0x15 (PortInfo)
>                                 resv....................0x0
>                                 attr_mod................0xC
>                                 
> m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
>
>                                 Initial path: [0][1][D]
>                                 Return path:  [0][1][18]
>                                 Reserved:     [0][0][0][0][0][0][0]
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 
> 00 00 00 00
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 
> 18 03 03 02
>
>                                 31 22 00 13 40 40 00 08   08 04 F2 40 
> 00 00 00 00
>
>                                 00 00 00 00 00 88 00 00   00 00 00 00 
> 00 00 00 00
>
> Sep 13 12:07:57 005891 [40FFF970] -> osm_pi_rcv_process_set: ERR 0F10: 
> Received Error Status for SetResp()
> Sep 13 12:07:57 005908 [40FFF970] -> PortInfo dump:
>                                 port number.............0xC
>                                 
> node_guid...............0x0002c90200402915
>                                 
> port_guid...............0x0002c90200402915
>                                 
> m_key...................0x0000000000000000
>                                 
> subnet_prefix...........0x0000000000000000
>                                 base_lid................0x0
>                                 master_sm_base_lid......0x0
>                                 capability_mask.........0x0
>                                 diag_code...............0x0
>                                 m_key_lease_period......0x0
>                                 local_port_num..........0x18
>                                 link_width_enabled......0x3
>                                 link_width_supported....0x3
>                                 link_width_active.......0x2
>                                 link_speed_supported....0x3
>                                 port_state..............DOWN
>                                 state_info2.............0x22
>                                 m_key_protect_bits......0x0
>                                 lmc.....................0x0
>                                 link_speed..............0x13
>                                 mtu_smsl................0x40
>                                 vl_cap_init_type........0x40
>                                 vl_high_limit...........0x0
>                                 vl_arb_high_cap.........0x8
>                                 vl_arb_low_cap..........0x8
>                                 init_rep_mtu_cap........0x4
>                                 vl_stall_life...........0xF2
>                                 vl_enforce..............0x40
>                                 m_key_violations........0x0
>                                 p_key_violations........0x0
>                                 q_key_violations........0x0
>                                 guid_cap................0x0
>                                 subnet_timeout..........0x0
>                                 resp_time_value.........0x0
>                                 error_threshold.........0x88
> Sep 13 12:07:57 005951 [40FFF970] -> Capabilities Mask:
>
> That is when things stop working. Likely multicast in that switch is 
> not
> working. I'd be curious whether the multicast setup in that switch is
> trashed or not. That can be determined with the diag tools. Let me know
> if you would like me to document the procedure for this.
>
> There is a pending issue with Sets of PortInfo getting this status back
> which has been on this list. Not sure whether this is a related problem
> or not.
>
> -- Hal
>
>