***SPAM*** Re: [ofa-general] Need help with Infiniband problem
Hal Rosenstock
hal.rosenstock at gmail.com
Wed Mar 18 14:34:16 PDT 2009
On Wed, Mar 18, 2009 at 5:15 PM, jeffrey Lang <jrlang at uwyo.edu> wrote:
> I'm using the version currently that is released with Centos 4.X which shows
> as "infiniband-diags-1.3.6-1.el4". found this in the syslog
>
> Mar 18 13:47:43 h2o01 saquery[21966]: Unable to Open IBT Device[/dev/SysIbt]
I'm unfamiliar with the Centos port and have never seen a message like
that so I haven't a clue.
-- Hal
> Now i just need to figure out why the device entry doesn't exist.
>
> With the firmware update the errors below have disappeared and the IPOIB is
> now working.
>
> jeff
>
> Hal Rosenstock wrote:
>
> On Wed, Mar 18, 2009 at 10:37 AM, jeffrey Lang <jrlang at uwyo.edu> wrote:
>
>
> Here's the output for ibchecknet:
> [root at h2o01 ~]# ibchecknet
> perfquery: iberror: failed: smp query nodeinfo: Node type not CA
>
>
> What diags version is being used ?
>
>
> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port all:
> FAILED
> #warn: counter SymbolErrors = 43259 (threshold 10) lid 1 port 17
> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 17:
> FAILED
> #warn: counter LinkRecovers = 207 (threshold 10) lid 1 port 2
> #warn: counter RcvErrors = 112 (threshold 10) lid 1 port 2
> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 2:
> FAILED
> #warn: counter LinkDowned = 10 (threshold 10) lid 1 port 1
> #warn: counter RcvErrors = 95 (threshold 10) lid 1 port 1
> Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 1:
> FAILED
>
>
> Are the counts for these ports (1,2,17) changing ? You can look with
> perfquery 1 <port #>.
> This is a separate issue from the lack of an IPoIB broadcast group
> assuming these numbers are incrementing.
> -- Hal
>
>
> # Checking Ca: nodeguid 0x00066a0098007e99
> # Checking Ca: nodeguid 0x00066a0098007e9b
> # Checking Ca: nodeguid 0x00066a0098007e97
> # Checking Ca: nodeguid 0x00066a0098007e8c
> # Checking Ca: nodeguid 0x00066a0098007e94
> # Checking Ca: nodeguid 0x00066a0098007e93
> # Checking Ca: nodeguid 0x00066a0098007e8e
> # Checking Ca: nodeguid 0x00066a0098007e90
> # Checking Ca: nodeguid 0x00066a0098007e98
> # Checking Ca: nodeguid 0x00066a0098007e95
> # Checking Ca: nodeguid 0x00066a0098007e8f
> # Checking Ca: nodeguid 0x00066a0098007e92
> # Checking Ca: nodeguid 0x00066a0098007e8d
> # Checking Ca: nodeguid 0x00066a0098007e91
> # Checking Ca: nodeguid 0x00066a0098007e96
> # Checking Ca: nodeguid 0x00066a0098007e9c
> ## Summary: 17 nodes checked, 0 bad nodes found
> ## 32 ports checked, 0 bad ports found
> ## 3 ports have errors beyond threshold
> I see these messages in the switch log now:
> E|2009/03/18 07:34:28.635S: Thread "esm_sar" (0x83394a90)
> ESM: Embedded SM Error: sa_McMemberRecord_Set: Component mask of
> 0x0000000000010083 does not have bits required to create a group
> (0x00000000000130C6) for new MGID of 0xFF12401BFFFF0000:00000000FFFFFFFF for
> request from h2o12 HCA-1, Port 0x00066A00A0007E8E, LID 0x000C, returning
> status 0x0600 : 0
> I would have to assume that this is my problem, but how to fix?
> jeff
> Hal Rosenstock wrote:
> On Tue, Mar 17, 2009 at 6:04 PM, jeffrey Lang <jrlang at uwyo.edu> wrote:
> Here's the output smpquery portinfo -D 0 as requested below:
> [root at h2o01 ~]# smpquery portinfo -D 0
> # Port info: DR path 0 port 0
> Mkey:............................0x0000000000000000
> GidPrefix:.......................0xfe80000000000000
> Lid:.............................0x0003
> SMLid:...........................0x0001
> CapMask:.........................0x2510a68
> IsTrapSupported
> IsAutomaticMigrationSupported
> IsSLMappingSupported
> IsLedInfoSupported
> IsSystemImageGUIDsupported
> IsCommunicatonManagementSupported
> IsVendorClassSupported
> IsCapabilityMaskNoticeSupported
> IsClientRegistrationSupported
> DiagCode:........................0x0000
> MkeyLeasePeriod:.................0
> LocalPort:.......................1
> LinkWidthEnabled:................1X or 4X
> LinkWidthSupported:..............1X or 4X
> LinkWidthActive:.................4X
> LinkSpeedSupported:..............2.5 Gbps
> LinkState:.......................Active
> PhysLinkState:...................LinkUp
> LinkDownDefState:................Polling
> ProtectBits:.....................0
> LMC:.............................0
> LinkSpeedActive:.................2.5 Gbps
> LinkSpeedEnabled:................2.5 Gbps
> NeighborMTU:.....................2048
> SMSL:............................0
> VLCap:...........................VL0-3
> InitType:........................0x00
> VLHighLimit:.....................0
> VLArbHighCap:....................8
> VLArbLowCap:.....................8
> InitReply:.......................0x00
> MtuCap:..........................2048
> VLStallCount:....................7
> HoqLife:.........................0
> OperVLs:.........................VL0-3
> PartEnforceInb:..................0
> PartEnforceOutb:.................0
> FilterRawInb:....................0
> FilterRawOutb:...................0
> MkeyViolations:..................0
> PkeyViolations:..................0
> QkeyViolations:..................0
> GuidCap:.........................32
> ClientReregister:................0
> SubnetTimeout:...................17
> RespTimeVal:.....................16
> LocalPhysErr:....................15
> OverrunErr:......................15
> MaxCreditHint:...................0
> RoundTrip:.......................0
> Looks fine.
> I did some checking, and It's not just this node having problems, all nodes
> seem to be having this same problem.
> Would you also run ibchecknet ?
> What error messages are on the SM side ?
> -- Hal
> jeff
> Hal Rosenstock wrote:
> 2009/3/17 jeffrey Lang <jrlang at uwyo.edu>:
> First let me say, I hope this is the right list for this email, if not
> please forgive me.
> I have a small 16 node compute cluster. The university where I work at
> recently opened a new Datacenter. My cluster was moved from the old
> Datacenter. Before the move the inifiniband was working properly, after
> the move the ipoib has stopped working.
> The cluster runs Centos 4 with all the latest updates and the Centos
> distributed OFED code. My plan was to update the OFED code once things had
> restablized.
> For the move, I shutdown the cluster, removed the inifiniband cables and the
> cluster was moved. I then reinstalled the infiniband cables (not in the
> same order before the move) and brought every thing back up.
> When i brought the cluster back up the ipoib would not work. The only
> message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast
> join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".
> I think that there may be a rate issue in terms of this node relative
> to the IPoIB broadcast group which by default is 10 Gbps (4x SDR).
> What does this node's portinfo show (smpquery portinfo -D 0) in terms
> of link width and speed ?
> -- Hal
> The master node can see all the systems:
> [root at h2o01 log]# ibnodes
> Ca : 0x00066a0098007e99 ports 1 "h2o17 HCA-1"
> Ca : 0x00066a0098007e9b ports 1 "h2o18 HCA-1"
> Ca : 0x00066a0098007e97 ports 1 "h2o16 HCA-1"
> Ca : 0x00066a0098007e8c ports 1 "h2o15 HCA-1"
> Ca : 0x00066a0098007e94 ports 1 "h2o14 HCA-1"
> Ca : 0x00066a0098007e93 ports 1 "h2o13 HCA-1"
> Ca : 0x00066a0098007e8e ports 1 "h2o12 HCA-1"
> Ca : 0x00066a0098007e90 ports 1 "h2o11 HCA-1"
> Ca : 0x00066a0098007e98 ports 1 "h2o10 HCA-1"
> Ca : 0x00066a0098007e95 ports 1 "h2o09 HCA-1"
> Ca : 0x00066a0098007e8f ports 1 "h2o08 HCA-1"
> Ca : 0x00066a0098007e92 ports 1 "h2o07 HCA-1"
> Ca : 0x00066a0098007e8d ports 1 "h2o06 HCA-1"
> Ca : 0x00066a0098007e91 ports 1 "h2o05 HCA-1"
> Ca : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1"
> Ca : 0x00066a0098007e9c ports 1 "h2o01 HCA-1"
> Switch : 0x00066a00d8000593 ports 24 "SilverStorm 9024
> GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0
> I've reset the sm on the switch, but nothing seems to work.
> Any ideas of where to look for whats causing the problem?
> jeff
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list