[openib-general] [ANNOUCEv2] OpenIB OpenSM 1.1.0: trunk now supports 1.8.0 features
Brett Bode
brett at scl.ameslab.gov
Wed Sep 14 07:01:13 PDT 2005
Hal,
Let's see how many of these I can tackle... There are two switches
in the setup and both are brand new 24 port DDR2 switches from Mellanox
(sorry i don't know the switch part off the top of my head). Most of
the NICs are rev a1 based NICs that have a fairly recent firmware on
them. Though the opensm is running on a MT25208 InfiniHost III Ex in a
dual opteron. The node that failed is an 8 way IBM pSeries p655 with:
[ 176.575945] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23,
2005)
[ 176.674674] ib_mthca: Initializing Mellanox Technologies MT23108
InfiniHost (0001:62:00.0)
[ 176.800432] PCI: Enabling device: (0001:62:00.0), cmd 142
lspci on a matching node gives:
0001:62:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1)
Subsystem: Mellanox Technology MT23108 InfiniHost
Flags: bus master, 66MHz, medium devsel, latency 144, IRQ 201
Memory at d8800000 (64-bit, non-prefetchable) [size=1M]
Memory at d8000000 (64-bit, prefetchable) [size=8M]
Capabilities: <available only to root>
The pSeries nodes have firmware to hide the onboard memory due to some
issues with openfirmware... At this point the node that had the issue
is in a weird state where I can still login and perform some commands,
but some fail (lspci hangs) and its routing is a bit screwed up as it
can't see hosts over the ethernet properly either. I think that is
because it crashed in the ipoib code. Here is the kernel oops:
[2507694.118336] Oops: Kernel access of bad area, sig: 11 [#1]
[2507694.131400] SMP NR_CPUS=8 PSERIES
[2507694.139788] Modules linked in: pvfs2 ib_ipoib ib_sa ib_mthca
ib_mad ib_core
[2507694.156537] NIP: D0000000006216A0 XER: 20000000 LR:
D000000000621680 CTR: C0000000001C5CD8
[2507694.176313] REGS: c0000007fe73f8a0 TRAP: 0300 Not tainted
(2.6.12.3-power4)
[2507694.193613] MSR: 9000000000001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR:
11 CR: 24000084
[2507694.211341] DAR: 0000000000000008 DSISR: 0000000042000000
[2507694.224396] TASK: c0000000081e1000[23] 'events/5' THREAD:
c0000007fe73c000 CPU: 5
[2507694.241849] GPR00: D000000000621680 C0000007FE73FB20
D000000000636A98 C00000079D14A2C0
[2507694.261054] GPR04: D00000000062E648 C00000003FA4B080
0000000000000000 0000000000000000
[2507694.280223] GPR08: 0000000000000000 0000000000000000
C00000003FA557D0 C00000077CA81E80
[2507694.299362] GPR12: D0000000006289D0 C00000000052BC00
0000000000000000 0000000000000000
[2507694.318568] GPR16: 0000000000000000 0000000000000000
0000000003A10000 000000000291FE84
[2507694.337756] GPR20: 0000000000000038 0000000003EB8E28
0000000800000000 C0000000041B1AC0
[2507694.356925] GPR24: C00000002CE34380 9000000000009032
C00000003FA557E8 C00000003FA55780
[2507694.376130] GPR28: 0000000000000000 C0000007918252C0
D000000000635F80 C0000007918252C0
[2507694.395650] NIP [d0000000006216a0] .path_free+0x1a8/0x26c
[ib_ipoib]
[2507694.410972] LR [d000000000621680] .path_free+0x188/0x26c [ib_ipoib]
[2507694.426015] Call Trace:
[2507694.432189] [c0000007fe73fb20] [d000000000621680]
.path_free+0x188/0x26c [ib_ipoib] (unreliable)
[2507694.453198] [c0000007fe73fbd0] [d000000000621864]
.ipoib_flush_paths+0x100/0x148 [ib_ipoib]
[2507694.473184] [c0000007fe73fc80] [d0000000006249c0]
.ipoib_ib_dev_down+0x13c/0x194 [ib_ipoib]
[2507694.493149] [c0000007fe73fd20] [d000000000625004]
.ipoib_ib_dev_flush+0x44/0xac [ib_ipoib]
[2507694.512946] [c0000007fe73fdb0] [c00000000005ca0c]
.worker_thread+0x244/0x318
[2507694.529880] [c0000007fe73fee0] [c0000000000630e4]
.kthread+0x154/0x1a4
[2507694.545601] [c0000007fe73ff90] [c000000000013508]
.kernel_thread+0x4c/0x6c
[2507694.562225] Instruction dump:
[2507694.569557] 38630020 419effb8 e89e8008 48007355 e8410028 e93d0020
7fa3eb78 fb890058
[2507694.588139] 60000000 e97d0020 7ffdfb78 e92b00d8 <fb890008>
480072fd e8410028 381f0028
[2507694.607156]
What is the procedure for determining if the multicast setup on the
switch is trashed?
I suspect that if it is, the crashed node is causing it as I had power
cycled the switch yesterday which seemed to get things working up until
I plugged the crashed node in again.
Brett
On Sep 13, 2005, at 9:12 PM, Hal Rosenstock wrote:
> Hi Troy,
>
> On Tue, 2005-09-13 at 20:12, Troy Benjegerdes wrote:
>
> Here is my analysis of the log you provided. I need to do a little more
> digging. I am curious as to the switch type and firmware versions of
> that switch and the failed HCA.
>
>> At the log entry 'Sep 13 12:06:55', I plugged in the node that is
>> hung/crashed
>> .. which caused a bunch of opensm errors.. I have since unplugged that
>> node, and can put it back in tommorow if you want more debug info.
>
> At that point in time, we see the following:
>
> Sep 13 12:06:55 936933 [417FF970] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E
> TID:0x0000000000000013
> Sep 13 12:06:55 937087 [417FF970] -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 from LID:0x0000
> GID:0xfe80000000000000,0x0002c90200402915
> Sep 13 12:06:56 354422 [42FFF970] -> umad_receiver: ERR 5409: send
> completed with error (method=1 attr=11) -- dropping.
> Sep 13 12:06:56 354439 [42FFF970] -> umad_receiver: ERR 5411: DR SMP
> hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0
> Sep 13 12:06:56 354449 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT).
>
> Trap 128 is an urgent Link state of switch port changed trap.
> It looks like a solicited send failed (SubnGet NodeInfo). We had an
> exchange on this a while ago on the list in terms of an unresponsive
> port.
>
> Sep 13 12:06:56 363771 [40FFF970] -> osm_drop_mgr_process: ERR 0108:
> Unknown remote side for node 0x0002c90200402915 port 12. Adding to
> light sweep sampling list.
> Sep 13 12:06:56 363815 [40FFF970] -> Directed Path Dump of 2 hop path:
> Path = [0][1][D]
>
> The DR display is showing the path to the switch. The dump of the SMP
> shows:
> hop_ptr.................0x0
> hop_count...............0x3
> Initial path: [0][1][D][C]
> Also, the GUID cited is an HCA GUID rather than a switch GUID so I
> doubt
> it has 12 ports. I think these are just problems with the debug
> messages.
>
> Earlier in the log:
>
> Sep 13 12:03:51 959970 [417FF970] -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 from LID:0x0001
> GID:0xfe80000000000000,0x0002c90200402781
> Sep 13 12:03:51 959986 [417FF970] -> Discovered new port with
> GUID:0x0002c90200402915 LID range [0xE,0xE] of node:MT47396
> Infiniscale-III Mellanox Technologies
> Sep 13 12:03:51 959996 [417FF970] -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 from LID:0x0001
> GID:0xfe80000000000000,0x0002c90200402781
>
> It appears that the failed node is a MT47396 off switch
> 0x0002c90200402781.
> What firmware version is running in both of these ? What is switch
> 0x0002c90200402781 ?
>
> A minor issue but the DR display above is not correct. The dump of the
> SMP shows:
> hop_ptr.................0x0
> hop_count...............0x3
> Initial path: [0][1][D][C]
>
> It seems to repeat this over and over again every few seconds until
> things break I presume at 12:07:57.
>
> The key to me is that OpenSM continues to receive:
> Sep 13 12:07:23 542642 [40FFF970] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E
> TID:0x000000000000002c
> Sep 13 12:07:23 542771 [40FFF970] -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 from LID:0x0000
> GID:0xfe80000000000000,0x0002c90200402915
>
> Either OpenSM never shuts this off or it keeps bouncing the port in the
> light sweep. I need to investigate this further.
>
> It all ends when:
> Sep 13 12:07:56 574831 [40FFF970] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E
> TID:0x0000000000000057
> Sep 13 12:07:56 574961 [40FFF970] -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 from LID:0x0000
> GID:0xfe80000000000000,0x0002c90200402915
> Sep 13 12:07:56 719968 [417FF970] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000E
> TID:0x0000000000000058
> Sep 13 12:07:56 720052 [417FF970] -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 from LID:0x0000
> GID:0xfe80000000000000,0x0002c90200402915
>
> and then that switch returns a bad status in a SM GetResp PortInfo (in
> response to a SM Set PortInfo):
>
> Sep 13 12:07:57 005832 [42FFF970] -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x81
> (SubnGetResp)
> D bit...................0x1
> status..................0x1C00
> hop_ptr.................0x0
> hop_count...............0x2
> trans_id................0x455a
> attr_id.................0x15 (PortInfo)
> resv....................0x0
> attr_mod................0xC
>
> m_key...................0x0000000000000000
> dr_slid.................0xFFFF
> dr_dlid.................0xFFFF
>
> Initial path: [0][1][D]
> Return path: [0][1][18]
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 18 03 03 02
>
> 31 22 00 13 40 40 00 08 08 04 F2 40
> 00 00 00 00
>
> 00 00 00 00 00 88 00 00 00 00 00 00
> 00 00 00 00
>
> Sep 13 12:07:57 005891 [40FFF970] -> osm_pi_rcv_process_set: ERR 0F10:
> Received Error Status for SetResp()
> Sep 13 12:07:57 005908 [40FFF970] -> PortInfo dump:
> port number.............0xC
>
> node_guid...............0x0002c90200402915
>
> port_guid...............0x0002c90200402915
>
> m_key...................0x0000000000000000
>
> subnet_prefix...........0x0000000000000000
> base_lid................0x0
> master_sm_base_lid......0x0
> capability_mask.........0x0
> diag_code...............0x0
> m_key_lease_period......0x0
> local_port_num..........0x18
> link_width_enabled......0x3
> link_width_supported....0x3
> link_width_active.......0x2
> link_speed_supported....0x3
> port_state..............DOWN
> state_info2.............0x22
> m_key_protect_bits......0x0
> lmc.....................0x0
> link_speed..............0x13
> mtu_smsl................0x40
> vl_cap_init_type........0x40
> vl_high_limit...........0x0
> vl_arb_high_cap.........0x8
> vl_arb_low_cap..........0x8
> init_rep_mtu_cap........0x4
> vl_stall_life...........0xF2
> vl_enforce..............0x40
> m_key_violations........0x0
> p_key_violations........0x0
> q_key_violations........0x0
> guid_cap................0x0
> subnet_timeout..........0x0
> resp_time_value.........0x0
> error_threshold.........0x88
> Sep 13 12:07:57 005951 [40FFF970] -> Capabilities Mask:
>
> That is when things stop working. Likely multicast in that switch is
> not
> working. I'd be curious whether the multicast setup in that switch is
> trashed or not. That can be determined with the diag tools. Let me know
> if you would like me to document the procedure for this.
>
> There is a pending issue with Sets of PortInfo getting this status back
> which has been on this list. Not sure whether this is a related problem
> or not.
>
> -- Hal
>
>
More information about the general
mailing list