[ofa-general] opensm logoutput
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Thu Feb 19 05:28:06 PST 2009
Bert,
Line.Holen at Sun.COM wrote:
> Hi Bert,
>
> most of these messages indicates that you do have unstable links in your
> system.
> But there is one message that can indicate that you've hit a newly
> discovered SM bug:
>
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
This message is probably also related to the unstable links (or nodes).
Some port didn't answer a query from the SM (see below), so SM warns
that there is a port that is physically not down, but the other side
of the link couldn't be probed.
> If you do have NEM switches in your system, then you are exposed to this
> bug.
> I hit it quite easily.
>
> Yevgeny Kliteynik posted a patch for this bug just a few minutes after
> you sent
> your email. (If you are interested look for the email thread "create
> physp for the
> newly discovered port of the known node").
Of course, using the patch wouldn't hurt :)
> Line
>
> On 02/17/09 01:23 PM, Wiegers, Bert wrote:
>> Hi,
>>
>> we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from
>> SUN.
>> As we are debugging our System I'm trying to understand the
>> opensm.log's.
>> (Where can I find any documentation to that?)
>>
>>
>> We see frequent messages as follows:
>>
>> Feb 17 10:25:34 134964 [41802940] 0x01 ->
>> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
>> (Link state change) Producer:2 (Switch) from LID:111
>> TID:0x000000000000006e
>> Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:1 num:128 (Link state change) from LID:111
>> GID:fe80::14:4fa4:cff8:50
Generic notice num. 128 (trap 128) is issued by switch (LID 111) because
it detected port state change on one of its ports, could be because of
unstable link, could be something else. SM logs that it got this trap from
the switch.
>> Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:65 (GID out of service) from LID:336
>> GID:fe80::3:ba00:100:3341
SM can't find some port any more, so it informs the fabric that
this GID is "out of service" by sending notice num. 65.
>> Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port:
>> Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of
>> node:MT25408 ConnectX Mellanox Technologies
LID 1047 is no longer reachable and removed from the SM's DB.
>> Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
>> tables configured on all switches
>> Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP
>> Feb 17 10:25:46 662611 [41802940] 0x01 ->
>> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
>> (Link state change) Producer:2 (Switch) from LID:111
>> TID:0x000000000000006f
>> Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:1 num:128 (Link state change) from LID:111
>> GID:fe80::14:4fa4:cff8:50
>> Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
>> tables configured on all switches
>> Feb 17 10:25:52 476653 [44007940] 0x01 ->
>> __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00
>> Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump:
>> base_ver................0x1
>> mgmt_class..............0x81
>> class_ver...............0x1
>> method..................0x81
>> (SubnGetResp)
>> D bit...................0x1
>> status..................0x1C00
>> hop_ptr.................0x0
>> hop_count...............0x4
>> trans_id................0x18c08de
>> attr_id.................0x15 (PortInfo)
>> resv....................0x0
>> attr_mod................0x6
>>
>> m_key...................0x0000000000000000
>> dr_slid.................65535
>> dr_dlid.................65535
>>
>> Initial path: 0,1,10,15,23
>> Return path: 0,23,20,12,17
>> Reserved: [0][0][0][0][0][0][0]
>>
>> 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00
>>
>> 00 00 00 00 00 00 00 00 00 00 00 00 11
>> 03 03 02
>>
>> 34 52 00 23 40 40 00 08 08 04 F0 4C 00
>> 00 00 00
>>
>> 00 00 00 00 00 88 00 00 00 00 00 00 00
>> 00 00 00
>>
>>
>>
>>
>> Other issues I see with messages similar to the following ones:
>>
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
>> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
>> po
>>
>> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
>> (IB_TIMEOUT)
The above two messages are related. The IB_TIMEOUT says that some MAD
was sent, but no response was received. This, in turn, would cause the
"unknown remote side" message.
Bottom line - there might be unstable ports/links in the fabric.
Check all the links that reported by the SM as having an unknown
remote side.
-- Yevgeny
>> osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5
>> (Invalid argument)
>>
>> I'm still googleing, but hopefully someone can give me some answers.
>>
>>
>>
>> Thanks and best regards
>> Bert
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list