[ofa-general] opensm logoutput

Line.Holen at Sun.COM Line.Holen at Sun.COM
Thu Feb 19 03:42:00 PST 2009


Hi Bert,

most of these messages indicates that you do have unstable links in your 
system.
But there is one message that can indicate that you've hit a newly 
discovered SM bug:

__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
po

If you do have NEM switches in your system, then you are exposed to this 
bug.
I hit it quite easily.

Yevgeny Kliteynik posted a patch for this bug just a few minutes after 
you sent
your email. (If you are interested look for the email thread "create 
physp for the
newly discovered port of the known node").

Line

On 02/17/09 01:23 PM, Wiegers, Bert wrote:
> Hi,
>
> we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from
> SUN.
> As we are debugging our System I'm trying to understand the
> opensm.log's.
> (Where can I find any documentation to that?)
>
>
> We see frequent messages as follows:
>
> Feb 17 10:25:34 134964 [41802940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
> (Link state change) Producer:2 (Switch) from LID:111
> TID:0x000000000000006e
> Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 (Link state change) from LID:111
> GID:fe80::14:4fa4:cff8:50
> Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:65 (GID out of service) from LID:336
> GID:fe80::3:ba00:100:3341
> Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port:
> Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of
> node:MT25408 ConnectX Mellanox Technologies
> Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
> tables configured on all switches
> Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP
> Feb 17 10:25:46 662611 [41802940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
> (Link state change) Producer:2 (Switch) from LID:111
> TID:0x000000000000006f
> Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:1 num:128 (Link state change) from LID:111
> GID:fe80::14:4fa4:cff8:50
> Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
> tables configured on all switches
> Feb 17 10:25:52 476653 [44007940] 0x01 ->
> __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00
> Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x81
> (SubnGetResp)
>                                 D bit...................0x1
>                                 status..................0x1C00
>                                 hop_ptr.................0x0
>                                 hop_count...............0x4
>                                 trans_id................0x18c08de
>                                 attr_id.................0x15 (PortInfo)
>                                 resv....................0x0
>                                 attr_mod................0x6
>  
> m_key...................0x0000000000000000
>                                 dr_slid.................65535
>                                 dr_dlid.................65535
>
>                                 Initial path: 0,1,10,15,23
>                                 Return path:  0,23,20,12,17
>                                 Reserved:     [0][0][0][0][0][0][0]
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00
> 00 00 00
>
>                                 00 00 00 00 00 00 00 00   00 00 00 00 11
> 03 03 02
>
>                                 34 52 00 23 40 40 00 08   08 04 F0 4C 00
> 00 00 00
>
>                                 00 00 00 00 00 88 00 00   00 00 00 00 00
> 00 00 00
>
>
>
>
> Other issues I see with messages similar to the following ones:
>
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
> po
>
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
>
> osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5
> (Invalid argument)
>
>
> I'm still googleing, but hopefully someone can give me some answers.
>
>
>
> Thanks and best regards
> Bert
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   




More information about the general mailing list