[ofa-general] Need help with Infiniband problem

jeffrey Lang jrlang at uwyo.edu
Tue Mar 17 07:58:28 PDT 2009


First let me say, I hope this is the right list for this email, if not 
please forgive me.

I have a small 16 node compute cluster.    The university where I work 
at recently opened a new Datacenter.  My cluster was moved from the old 
Datacenter.   Before the move the inifiniband was working properly, 
after the move the ipoib has stopped working.

The cluster runs Centos 4 with all the latest updates and the Centos 
distributed OFED code.   My plan was to update the OFED code once things 
had restablized.

For the move, I shutdown the cluster, removed the inifiniband cables and 
the cluster was moved.   I then reinstalled the infiniband cables (not 
in the same order before the move) and brought every thing back up.

When i brought the cluster back up the ipoib would not work.  The only 
message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast 
join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".

The master node can see all the systems:

[root at h2o01 log]# ibnodes
Ca    : 0x00066a0098007e99 ports 1 "h2o17 HCA-1"
Ca    : 0x00066a0098007e9b ports 1 "h2o18 HCA-1"
Ca    : 0x00066a0098007e97 ports 1 "h2o16 HCA-1"
Ca    : 0x00066a0098007e8c ports 1 "h2o15 HCA-1"
Ca    : 0x00066a0098007e94 ports 1 "h2o14 HCA-1"
Ca    : 0x00066a0098007e93 ports 1 "h2o13 HCA-1"
Ca    : 0x00066a0098007e8e ports 1 "h2o12 HCA-1"
Ca    : 0x00066a0098007e90 ports 1 "h2o11 HCA-1"
Ca    : 0x00066a0098007e98 ports 1 "h2o10 HCA-1"
Ca    : 0x00066a0098007e95 ports 1 "h2o09 HCA-1"
Ca    : 0x00066a0098007e8f ports 1 "h2o08 HCA-1"
Ca    : 0x00066a0098007e92 ports 1 "h2o07 HCA-1"
Ca    : 0x00066a0098007e8d ports 1 "h2o06 HCA-1"
Ca    : 0x00066a0098007e91 ports 1 "h2o05 HCA-1"
Ca    : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1"
Ca    : 0x00066a0098007e9c ports 1 "h2o01 HCA-1"
Switch    : 0x00066a00d8000593 ports 24 "SilverStorm 9024 
GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0

I've reset the sm on the switch, but nothing seems to work.

Any ideas of where to look for whats causing the problem?

jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jrlang.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090317/54e09d7b/attachment.vcf>


More information about the general mailing list