[ofa-general] Need help with Infiniband problem

jeffrey Lang jrlang at uwyo.edu
Tue Mar 17 15:04:11 PDT 2009


Here's the output smpquery portinfo -D 0 as requested below:

[root at h2o01 ~]# smpquery portinfo -D 0
# Port info: DR path 0 port 0
Mkey:............................0x0000000000000000
GidPrefix:.......................0xfe80000000000000
Lid:.............................0x0003
SMLid:...........................0x0001
CapMask:.........................0x2510a68
                IsTrapSupported
                IsAutomaticMigrationSupported
                IsSLMappingSupported
                IsLedInfoSupported
                IsSystemImageGUIDsupported
                IsCommunicatonManagementSupported
                IsVendorClassSupported
                IsCapabilityMaskNoticeSupported
                IsClientRegistrationSupported
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................1
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
LinkSpeedActive:.................2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
NeighborMTU:.....................2048
SMSL:............................0
VLCap:...........................VL0-3
InitType:........................0x00
VLHighLimit:.....................0
VLArbHighCap:....................8
VLArbLowCap:.....................8
InitReply:.......................0x00
MtuCap:..........................2048
VLStallCount:....................7
HoqLife:.........................0
OperVLs:.........................VL0-3
PartEnforceInb:..................0
PartEnforceOutb:.................0
FilterRawInb:....................0
FilterRawOutb:...................0
MkeyViolations:..................0
PkeyViolations:..................0
QkeyViolations:..................0
GuidCap:.........................32
ClientReregister:................0
SubnetTimeout:...................17
RespTimeVal:.....................16
LocalPhysErr:....................15
OverrunErr:......................15
MaxCreditHint:...................0
RoundTrip:.......................0


I did some checking, and It's not just this node having problems, all 
nodes seem to be having this same problem.

jeff


Hal Rosenstock wrote:
> 2009/3/17 jeffrey Lang <jrlang at uwyo.edu>:
>   
>> First let me say, I hope this is the right list for this email, if not
>> please forgive me.
>>
>> I have a small 16 node compute cluster.    The university where I work at
>> recently opened a new Datacenter.  My cluster was moved from the old
>> Datacenter.   Before the move the inifiniband was working properly, after
>> the move the ipoib has stopped working.
>>
>> The cluster runs Centos 4 with all the latest updates and the Centos
>> distributed OFED code.   My plan was to update the OFED code once things had
>> restablized.
>>
>> For the move, I shutdown the cluster, removed the inifiniband cables and the
>> cluster was moved.   I then reinstalled the infiniband cables (not in the
>> same order before the move) and brought every thing back up.
>>
>> When i brought the cluster back up the ipoib would not work.  The only
>> message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast
>> join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".
>>     
>
> I think that there may be a rate issue in terms of this node relative
> to the IPoIB broadcast group which by default is 10 Gbps (4x SDR).
> What does this node's portinfo show (smpquery portinfo -D 0) in terms
> of link width and speed ?
>
> -- Hal
>
>   
>> The master node can see all the systems:
>>
>> [root at h2o01 log]# ibnodes
>> Ca    : 0x00066a0098007e99 ports 1 "h2o17 HCA-1"
>> Ca    : 0x00066a0098007e9b ports 1 "h2o18 HCA-1"
>> Ca    : 0x00066a0098007e97 ports 1 "h2o16 HCA-1"
>> Ca    : 0x00066a0098007e8c ports 1 "h2o15 HCA-1"
>> Ca    : 0x00066a0098007e94 ports 1 "h2o14 HCA-1"
>> Ca    : 0x00066a0098007e93 ports 1 "h2o13 HCA-1"
>> Ca    : 0x00066a0098007e8e ports 1 "h2o12 HCA-1"
>> Ca    : 0x00066a0098007e90 ports 1 "h2o11 HCA-1"
>> Ca    : 0x00066a0098007e98 ports 1 "h2o10 HCA-1"
>> Ca    : 0x00066a0098007e95 ports 1 "h2o09 HCA-1"
>> Ca    : 0x00066a0098007e8f ports 1 "h2o08 HCA-1"
>> Ca    : 0x00066a0098007e92 ports 1 "h2o07 HCA-1"
>> Ca    : 0x00066a0098007e8d ports 1 "h2o06 HCA-1"
>> Ca    : 0x00066a0098007e91 ports 1 "h2o05 HCA-1"
>> Ca    : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1"
>> Ca    : 0x00066a0098007e9c ports 1 "h2o01 HCA-1"
>> Switch    : 0x00066a00d8000593 ports 24 "SilverStorm 9024
>> GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0
>>
>> I've reset the sm on the switch, but nothing seems to work.
>>
>> Any ideas of where to look for whats causing the problem?
>>
>> jeff
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>>     
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090317/a4769f9a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jrlang.vcf
Type: text/x-vcard
Size: 311 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090317/a4769f9a/attachment.vcf>


More information about the general mailing list