[ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology.

Rafael David Tinoco Rafael.Tinoco at Sun.COM
Mon Aug 24 11:46:04 PDT 2009


Hello,

I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 
asics each, 8 qnems).
They are configured in a MESH topology.
I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.

I'm booting PXE from IB, my initrd image is bringing the ib0 interface, 
getting the squashfs image and mounting with aufs.

The problem is.. When booting more then 60 nodes, I start to get above 
errors on subnet manager.
And the problem seems to be intermitent, because each time it gives 
errors on different path.

Any ideas ?

Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713838 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713842 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713847 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713866 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting 
Generic Notice type:3 num:64 (GID in service) from LID:1 
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713871 [48D7D940] 0x02 -> 
__osm_state_mgr_report_new_ports: Discovered new port with 
GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
Aug 24 15:36:19 714805 [48D7D940] 0x01 -> 
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for 
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) 
port 19. Adding to light sweep sampling list
Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
                Path = 0,1,15,15,15
Aug 24 15:36:19 714822 [48D7D940] 0x01 -> 
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for 
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) 
port 21. Adding to light sweep sampling list
Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
                Path = 0,1,15,15,15
Aug 24 15:36:19 714831 [48D7D940] 0x01 -> 
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for 
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) 
port 25. Adding to light sweep sampling list
Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
                Path = 0,1,15,15,15
Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- 
dropping
Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR 
SMP Hop Ptr: 0x0
Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
                Initial path = 0,0,0,0,0,0
                Return path  = 0,0,0,0,0,0
Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: 
ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
                base_ver................0x1
                mgmt_class..............0x81
                class_ver...............0x1
                method..................0x1 (SubnGet)
                D bit...................0x0
                status..................0x0
                hop_ptr.................0x0
                hop_count...............0x5
                trans_id................0x36595
                attr_id.................0x15 (PortInfo)
                resv....................0x0
                attr_mod................0x0
                m_key...................0x0000000000000000
                dr_slid.................65535
                dr_dlid.................65535

                Initial path: 0,1,15,15,15,19
                Return path:  0,0,0,0,0,0
                Reserved:     [0][0][0][0][0][0][0]

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- 
dropping
Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR 
SMP Hop Ptr: 0x0
Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
                Initial path = 0,0,0,0,0,0
                Return path  = 0,0,0,0,0,0
Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: 
ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
                base_ver................0x1
                mgmt_class..............0x81
                class_ver...............0x1
                method..................0x1 (SubnGet)
                D bit...................0x0
                status..................0x0
                hop_ptr.................0x0
                hop_count...............0x5
                trans_id................0x36596
                attr_id.................0x15 (PortInfo)
                resv....................0x0
....


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090824/ddf18345/attachment.html>


More information about the general mailing list