[ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology.

Hal Rosenstock hal.rosenstock at gmail.com
Wed Aug 26 08:23:53 PDT 2009


Hi Rafael,

On 8/25/09, Rafael David Tinoco <Rafael.Tinoco at sun.com> wrote:
>
> Hello Hal,
>
> Bellow...
>
> Hal Rosenstock wrote:
>
>
>
> On 8/24/09, Rafael David Tinoco <Rafael.Tinoco at sun.com> wrote:
>>
>> Hello,
>>
>> I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics
>> each, 8 qnems).
>> They are configured in a MESH topology.
>> I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.
>>
>> I'm booting PXE from IB, my initrd image is bringing the ib0 interface,
>> getting the squashfs image and mounting with aufs.
>>
>> The problem is.. When booting more then 60 nodes, I start to get above
>> errors on subnet manager.
>> And the problem seems to be intermitent, because each time it gives errors
>> on different path.
>>
>> Any ideas ?
>>
>> Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713838 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
>> Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713842 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
>> Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713847 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
>> Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713866 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
>> Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting
>> Generic Notice type:3 num:64 (GID in service) from LID:1
>> GID:fe80::5080:200:8d:9931
>> Aug 24 15:36:19 713871 [48D7D940] 0x02 ->
>> __osm_state_mgr_report_new_ports: Discovered new port with
>> GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
>> Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
>> Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
>> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19.
>> Adding to light sweep sampling list
>> Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop
>> path:
>>                 Path = 0,1,15,15,15
>> Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
>> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21.
>> Adding to light sweep sampling list
>> Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop
>> path:
>>                 Path = 0,1,15,15,15
>> Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
>> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
>> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25.
>> Adding to light sweep sampling list
>> Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop
>> path:
>>                 Path = 0,1,15,15,15
>> Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
>> completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) --
>> dropping
>> Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
>> Hop Ptr: 0x0
>> Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>>                 Initial path = 0,0,0,0,0,0
>>                 Return path  = 0,0,0,0,0,0
>> Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
>> ERR 3113: MAD completed in error (IB_TIMEOUT)
>> Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
>>                 base_ver................0x1
>>                 mgmt_class..............0x81
>>                 class_ver...............0x1
>>                 method..................0x1 (SubnGet)
>>                 D bit...................0x0
>>                 status..................0x0
>>                 hop_ptr.................0x0
>>                 hop_count...............0x5
>>                 trans_id................0x36595
>>                 attr_id.................0x15 (PortInfo)
>>                 resv....................0x0
>>                 attr_mod................0x0
>>                 m_key...................0x0000000000000000
>>                 dr_slid.................65535
>>                 dr_dlid.................65535
>>
>>                 Initial path: 0,1,15,15,15,19
>>                 Return path:  0,0,0,0,0,0
>>                 Reserved:     [0][0][0][0][0][0][0]
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>>
>> Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
>> completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) --
>> dropping
>> Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
>> Hop Ptr: 0x0
>> Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>>                 Initial path = 0,0,0,0,0,0
>>                 Return path  = 0,0,0,0,0,0
>> Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
>> ERR 3113: MAD completed in error (IB_TIMEOUT)
>> Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
>>                 base_ver................0x1
>>                 mgmt_class..............0x81
>>                 class_ver...............0x1
>>                 method..................0x1 (SubnGet)
>>                 D bit...................0x0
>>                 status..................0x0
>>                 hop_ptr.................0x0
>>                 hop_count...............0x5
>>                 trans_id................0x36596
>>                 attr_id.................0x15 (PortInfo)
>>                 resv....................0x0
>> ....
>>
>
> These errors are transient as you indicate. They mean that some node has
> brought the link physically up but there is no SMA at the remote side of the
> link. The different paths are paths to the HCAs. This occurs during PXE boot
> as the node transitions from the boot ROM to the Linux environment.
>
>
> They are transient.. but sometimes opensm hangs with the same message and
> loops this errors messages.
>

Are you sure OpenSM hangs ? If so, any idea where ?

 First I was using centos 5.3 kernel with updates .. and the IPoIB stopped
> working after these messages.
>

Any specifics ?

 Using the "vanilla" centos 5.3 kernel solved this issue.
> But SOMETIMES, booting the nodes, these messages appear and dont go away.
>

In those cases, do the nodes succesfully boot up ?


  Other than these messages, do things seem to work in terms of the end
> nodes ?
>
> They seem to work with vanilla kernel. Even with the messages, no problems
> reaching the nodes so far.
>

Do your ULPs work (like IPoIB, etc.) ?

-- Hal

 Tks
>
> Rafael Tinoco
>
>
> -- Hal
>
> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090826/30187f6b/attachment.html>


More information about the general mailing list