[ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology.
Rafael David Tinoco
Rafael.Tinoco at Sun.COM
Mon Aug 24 11:46:04 PDT 2009
Hello,
I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2
asics each, 8 qnems).
They are configured in a MESH topology.
I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.
I'm booting PXE from IB, my initrd image is bringing the ib0 interface,
getting the squashfs image and mounting with aufs.
The problem is.. When booting more then 60 nodes, I start to get above
errors on subnet manager.
And the problem seems to be intermitent, because each time it gives
errors on different path.
Any ideas ?
Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:64 (GID in service) from LID:1
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713838 [48D7D940] 0x02 ->
__osm_state_mgr_report_new_ports: Discovered new port with
GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:64 (GID in service) from LID:1
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713842 [48D7D940] 0x02 ->
__osm_state_mgr_report_new_ports: Discovered new port with
GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:64 (GID in service) from LID:1
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713847 [48D7D940] 0x02 ->
__osm_state_mgr_report_new_ports: Discovered new port with
GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:64 (GID in service) from LID:1
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713866 [48D7D940] 0x02 ->
__osm_state_mgr_report_new_ports: Discovered new port with
GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:64 (GID in service) from LID:1
GID:fe80::5080:200:8d:9931
Aug 24 15:36:19 713871 [48D7D940] 0x02 ->
__osm_state_mgr_report_new_ports: Discovered new port with
GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A)
port 19. Adding to light sweep sampling list
Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
Path = 0,1,15,15,15
Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A)
port 21. Adding to light sweep sampling list
Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
Path = 0,1,15,15,15
Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A)
port 25. Adding to light sweep sampling list
Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
Path = 0,1,15,15,15
Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) --
dropping
Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR
SMP Hop Ptr: 0x0
Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
Initial path = 0,0,0,0,0,0
Return path = 0,0,0,0,0,0
Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
base_ver................0x1
mgmt_class..............0x81
class_ver...............0x1
method..................0x1 (SubnGet)
D bit...................0x0
status..................0x0
hop_ptr.................0x0
hop_count...............0x5
trans_id................0x36595
attr_id.................0x15 (PortInfo)
resv....................0x0
attr_mod................0x0
m_key...................0x0000000000000000
dr_slid.................65535
dr_dlid.................65535
Initial path: 0,1,15,15,15,19
Return path: 0,0,0,0,0,0
Reserved: [0][0][0][0][0][0][0]
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) --
dropping
Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR
SMP Hop Ptr: 0x0
Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
Initial path = 0,0,0,0,0,0
Return path = 0,0,0,0,0,0
Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb:
ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
base_ver................0x1
mgmt_class..............0x81
class_ver...............0x1
method..................0x1 (SubnGet)
D bit...................0x0
status..................0x0
hop_ptr.................0x0
hop_count...............0x5
trans_id................0x36596
attr_id.................0x15 (PortInfo)
resv....................0x0
....
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090824/ddf18345/attachment.html>
More information about the general
mailing list