[ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology.

Hal Rosenstock hal.rosenstock at gmail.com
Tue Aug 25 15:04:55 PDT 2009


On 8/24/09, Rafael David Tinoco <Rafael.Tinoco at sun.com> wrote:
>
> Hello,
>
> I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics
> each, 8 qnems).
> They are configured in a MESH topology.
> I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.
>
> I'm booting PXE from IB, my initrd image is bringing the ib0 interface,
> getting the squashfs image and mounting with aufs.
>
> The problem is.. When booting more then 60 nodes, I start to get above
> errors on subnet manager.
> And the problem seems to be intermitent, because each time it gives errors
> on different path.
>
> Any ideas ?
>
> Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713838 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008d9381 LID range [78,78] of
> node:b03n06 HCA-1
> Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713842 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008d4689 LID range [76,76] of
> node:b03n04 HCA-1
> Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713847 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008e5191 LID range [82,82] of
> node:b03n11 HCA-1
> Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713866 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008d94c9 LID range [80,80] of
> node:b03n08 HCA-1
> Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:64 (GID in service) from LID:1
> GID:fe80::5080:200:8d:9931
> Aug 24 15:36:19 713871 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports:
> Discovered new port with GUID:0x50800200008daedd LID range [83,83] of
> node:b03n12 HCA-1
> Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
> Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19.
> Adding to light sweep sampling list
> Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
>                 Path = 0,1,15,15,15
> Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21.
> Adding to light sweep sampling list
> Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
>                 Path = 0,1,15,15,15
> Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node
> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25.
> Adding to light sweep sampling list
> Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
>                 Path = 0,1,15,15,15
> Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) --
> dropping
> Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
> Hop Ptr: 0x0
> Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>                 Initial path = 0,0,0,0,0,0
>                 Return path  = 0,0,0,0,0,0
> Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT)
> Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
>                 base_ver................0x1
>                 mgmt_class..............0x81
>                 class_ver...............0x1
>                 method..................0x1 (SubnGet)
>                 D bit...................0x0
>                 status..................0x0
>                 hop_ptr.................0x0
>                 hop_count...............0x5
>                 trans_id................0x36595
>                 attr_id.................0x15 (PortInfo)
>                 resv....................0x0
>                 attr_mod................0x0
>                 m_key...................0x0000000000000000
>                 dr_slid.................65535
>                 dr_dlid.................65535
>
>                 Initial path: 0,1,15,15,15,19
>                 Return path:  0,0,0,0,0,0
>                 Reserved:     [0][0][0][0][0][0][0]
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
>                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
>
> Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) --
> dropping
> Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP
> Hop Ptr: 0x0
> Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
>                 Initial path = 0,0,0,0,0,0
>                 Return path  = 0,0,0,0,0,0
> Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT)
> Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
>                 base_ver................0x1
>                 mgmt_class..............0x81
>                 class_ver...............0x1
>                 method..................0x1 (SubnGet)
>                 D bit...................0x0
>                 status..................0x0
>                 hop_ptr.................0x0
>                 hop_count...............0x5
>                 trans_id................0x36596
>                 attr_id.................0x15 (PortInfo)
>                 resv....................0x0
> ....
>

These errors are transient as you indicate. They mean that some node has
brought the link physically up but there is no SMA at the remote side of the
link. The different paths are paths to the HCAs. This occurs during PXE boot
as the node transitions from the boot ROM to the Linux environment.

Other than these messages, do things seem to work in terms of the end nodes
?

-- Hal

_______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090825/1d36f13c/attachment.html>


More information about the general mailing list