[ofa-general] Need help with Infiniband problem

jeffrey Lang jrlang at uwyo.edu
Wed Mar 18 14:15:28 PDT 2009


I'm using the version currently that is released with Centos 4.X which 
shows as "infiniband-diags-1.3.6-1.el4".   found this in the syslog

Mar 18 13:47:43 h2o01 saquery[21966]: Unable to Open IBT Device[/dev/SysIbt]

Now i just need to figure out why the device entry doesn't exist.

With the firmware update the errors below have disappeared and the IPOIB 
is now working.  

jeff

Hal Rosenstock wrote:
> On Wed, Mar 18, 2009 at 10:37 AM, jeffrey Lang <jrlang at uwyo.edu> wrote:
>   
>> Here's the output for ibchecknet:
>>
>> [root at h2o01 ~]# ibchecknet
>> perfquery: iberror: failed: smp query nodeinfo: Node type not CA
>>     
>
> What diags version is being used ?
>
>   
>> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port all:
>> FAILED
>> #warn: counter SymbolErrors = 43259     (threshold 10) lid 1 port 17
>> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 17:
>> FAILED
>> #warn: counter LinkRecovers = 207       (threshold 10) lid 1 port 2
>> #warn: counter RcvErrors = 112  (threshold 10) lid 1 port 2
>> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 2:
>> FAILED
>> #warn: counter LinkDowned = 10  (threshold 10) lid 1 port 1
>> #warn: counter RcvErrors = 95   (threshold 10) lid 1 port 1
>> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 1:
>> FAILED
>>     
>
> Are the counts for these ports (1,2,17) changing ? You can look with
> perfquery 1 <port #>.
> This is a separate issue from the lack of an IPoIB broadcast group
> assuming these numbers are incrementing.
>
> -- Hal
>
>   
>> # Checking Ca: nodeguid 0x00066a0098007e99
>>
>> # Checking Ca: nodeguid 0x00066a0098007e9b
>>
>> # Checking Ca: nodeguid 0x00066a0098007e97
>>
>> # Checking Ca: nodeguid 0x00066a0098007e8c
>>
>> # Checking Ca: nodeguid 0x00066a0098007e94
>>
>> # Checking Ca: nodeguid 0x00066a0098007e93
>>
>> # Checking Ca: nodeguid 0x00066a0098007e8e
>>
>> # Checking Ca: nodeguid 0x00066a0098007e90
>>
>> # Checking Ca: nodeguid 0x00066a0098007e98
>>
>> # Checking Ca: nodeguid 0x00066a0098007e95
>>
>> # Checking Ca: nodeguid 0x00066a0098007e8f
>>
>> # Checking Ca: nodeguid 0x00066a0098007e92
>>
>> # Checking Ca: nodeguid 0x00066a0098007e8d
>>
>> # Checking Ca: nodeguid 0x00066a0098007e91
>>
>> # Checking Ca: nodeguid 0x00066a0098007e96
>>
>> # Checking Ca: nodeguid 0x00066a0098007e9c
>>
>> ## Summary: 17 nodes checked, 0 bad nodes found
>> ##          32 ports checked, 0 bad ports found
>> ##          3 ports have errors beyond threshold
>>
>>
>>
>>
>> I see these messages in the switch log now:
>>
>> E|2009/03/18 07:34:28.635S: Thread "esm_sar" (0x83394a90)
>>         ESM: Embedded SM Error: sa_McMemberRecord_Set: Component mask of
>> 0x0000000000010083 does not have bits required to create a group
>> (0x00000000000130C6) for new MGID of 0xFF12401BFFFF0000:00000000FFFFFFFF for
>> request from h2o12 HCA-1, Port 0x00066A00A0007E8E, LID 0x000C, returning
>> status 0x0600 : 0
>>
>>
>> I would have to assume that this is my problem, but how to fix?
>>
>> jeff
>>
>>
>>
>>
>>
>>
>> Hal Rosenstock wrote:
>>
>> On Tue, Mar 17, 2009 at 6:04 PM, jeffrey Lang <jrlang at uwyo.edu> wrote:
>>
>>
>> Here's the output smpquery portinfo -D 0 as requested below:
>> [root at h2o01 ~]# smpquery portinfo -D 0
>> # Port info: DR path 0 port 0
>> Mkey:............................0x0000000000000000
>> GidPrefix:.......................0xfe80000000000000
>> Lid:.............................0x0003
>> SMLid:...........................0x0001
>> CapMask:.........................0x2510a68
>>                 IsTrapSupported
>>                 IsAutomaticMigrationSupported
>>                 IsSLMappingSupported
>>                 IsLedInfoSupported
>>                 IsSystemImageGUIDsupported
>>                 IsCommunicatonManagementSupported
>>                 IsVendorClassSupported
>>                 IsCapabilityMaskNoticeSupported
>>                 IsClientRegistrationSupported
>> DiagCode:........................0x0000
>> MkeyLeasePeriod:.................0
>> LocalPort:.......................1
>> LinkWidthEnabled:................1X or 4X
>> LinkWidthSupported:..............1X or 4X
>> LinkWidthActive:.................4X
>> LinkSpeedSupported:..............2.5 Gbps
>> LinkState:.......................Active
>> PhysLinkState:...................LinkUp
>> LinkDownDefState:................Polling
>> ProtectBits:.....................0
>> LMC:.............................0
>> LinkSpeedActive:.................2.5 Gbps
>> LinkSpeedEnabled:................2.5 Gbps
>> NeighborMTU:.....................2048
>> SMSL:............................0
>> VLCap:...........................VL0-3
>> InitType:........................0x00
>> VLHighLimit:.....................0
>> VLArbHighCap:....................8
>> VLArbLowCap:.....................8
>> InitReply:.......................0x00
>> MtuCap:..........................2048
>> VLStallCount:....................7
>> HoqLife:.........................0
>> OperVLs:.........................VL0-3
>> PartEnforceInb:..................0
>> PartEnforceOutb:.................0
>> FilterRawInb:....................0
>> FilterRawOutb:...................0
>> MkeyViolations:..................0
>> PkeyViolations:..................0
>> QkeyViolations:..................0
>> GuidCap:.........................32
>> ClientReregister:................0
>> SubnetTimeout:...................17
>> RespTimeVal:.....................16
>> LocalPhysErr:....................15
>> OverrunErr:......................15
>> MaxCreditHint:...................0
>> RoundTrip:.......................0
>>
>>
>> Looks fine.
>>
>>
>> I did some checking, and It's not just this node having problems, all nodes
>> seem to be having this same problem.
>>
>>
>> Would you also run ibchecknet ?
>> What error messages are on the SM side ?
>> -- Hal
>>
>>
>> jeff
>> Hal Rosenstock wrote:
>> 2009/3/17 jeffrey Lang <jrlang at uwyo.edu>:
>> First let me say, I hope this is the right list for this email, if not
>> please forgive me.
>> I have a small 16 node compute cluster.    The university where I work at
>> recently opened a new Datacenter.  My cluster was moved from the old
>> Datacenter.   Before the move the inifiniband was working properly, after
>> the move the ipoib has stopped working.
>> The cluster runs Centos 4 with all the latest updates and the Centos
>> distributed OFED code.   My plan was to update the OFED code once things had
>> restablized.
>> For the move, I shutdown the cluster, removed the inifiniband cables and the
>> cluster was moved.   I then reinstalled the infiniband cables (not in the
>> same order before the move) and brought every thing back up.
>> When i brought the cluster back up the ipoib would not work.  The only
>> message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast
>> join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".
>> I think that there may be a rate issue in terms of this node relative
>> to the IPoIB broadcast group which by default is 10 Gbps (4x SDR).
>> What does this node's portinfo show (smpquery portinfo -D 0) in terms
>> of link width and speed ?
>> -- Hal
>> The master node can see all the systems:
>> [root at h2o01 log]# ibnodes
>> Ca    : 0x00066a0098007e99 ports 1 "h2o17 HCA-1"
>> Ca    : 0x00066a0098007e9b ports 1 "h2o18 HCA-1"
>> Ca    : 0x00066a0098007e97 ports 1 "h2o16 HCA-1"
>> Ca    : 0x00066a0098007e8c ports 1 "h2o15 HCA-1"
>> Ca    : 0x00066a0098007e94 ports 1 "h2o14 HCA-1"
>> Ca    : 0x00066a0098007e93 ports 1 "h2o13 HCA-1"
>> Ca    : 0x00066a0098007e8e ports 1 "h2o12 HCA-1"
>> Ca    : 0x00066a0098007e90 ports 1 "h2o11 HCA-1"
>> Ca    : 0x00066a0098007e98 ports 1 "h2o10 HCA-1"
>> Ca    : 0x00066a0098007e95 ports 1 "h2o09 HCA-1"
>> Ca    : 0x00066a0098007e8f ports 1 "h2o08 HCA-1"
>> Ca    : 0x00066a0098007e92 ports 1 "h2o07 HCA-1"
>> Ca    : 0x00066a0098007e8d ports 1 "h2o06 HCA-1"
>> Ca    : 0x00066a0098007e91 ports 1 "h2o05 HCA-1"
>> Ca    : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1"
>> Ca    : 0x00066a0098007e9c ports 1 "h2o01 HCA-1"
>> Switch    : 0x00066a00d8000593 ports 24 "SilverStorm 9024
>> GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0
>> I've reset the sm on the switch, but nothing seems to work.
>> Any ideas of where to look for whats causing the problem?
>> jeff
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>>     
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090318/a31a98ab/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jrlang.vcf
Type: text/x-vcard
Size: 311 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090318/a31a98ab/attachment.vcf>


More information about the general mailing list