<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
I'm using the version currently that is released with Centos 4.X which
shows as "infiniband-diags-1.3.6-1.el4". found this in the syslog <br>
<br>
Mar 18 13:47:43 h2o01 saquery[21966]: Unable to Open IBT
Device[/dev/SysIbt]<br>
<br>
Now i just need to figure out why the device entry doesn't exist.<br>
<br>
With the firmware update the errors below have disappeared and the
IPOIB is now working. <br>
<br>
jeff<br>
<br>
Hal Rosenstock wrote:
<blockquote
cite="mid:f0e08f230903181402l79e5e76bm96b59c1f7e49cf2b@mail.gmail.com"
type="cite">
<pre wrap="">On Wed, Mar 18, 2009 at 10:37 AM, jeffrey Lang <a class="moz-txt-link-rfc2396E" href="mailto:jrlang@uwyo.edu"><jrlang@uwyo.edu></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Here's the output for ibchecknet:
[root@h2o01 ~]# ibchecknet
perfquery: iberror: failed: smp query nodeinfo: Node type not CA
</pre>
</blockquote>
<pre wrap=""><!---->
What diags version is being used ?
</pre>
<blockquote type="cite">
<pre wrap="">Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port all:
FAILED
#warn: counter SymbolErrors = 43259 (threshold 10) lid 1 port 17
Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 17:
FAILED
#warn: counter LinkRecovers = 207 (threshold 10) lid 1 port 2
#warn: counter RcvErrors = 112 (threshold 10) lid 1 port 2
Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 2:
FAILED
#warn: counter LinkDowned = 10 (threshold 10) lid 1 port 1
#warn: counter RcvErrors = 95 (threshold 10) lid 1 port 1
Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 1:
FAILED
</pre>
</blockquote>
<pre wrap=""><!---->
Are the counts for these ports (1,2,17) changing ? You can look with
perfquery 1 <port #>.
This is a separate issue from the lack of an IPoIB broadcast group
assuming these numbers are incrementing.
-- Hal
</pre>
<blockquote type="cite">
<pre wrap=""># Checking Ca: nodeguid 0x00066a0098007e99
# Checking Ca: nodeguid 0x00066a0098007e9b
# Checking Ca: nodeguid 0x00066a0098007e97
# Checking Ca: nodeguid 0x00066a0098007e8c
# Checking Ca: nodeguid 0x00066a0098007e94
# Checking Ca: nodeguid 0x00066a0098007e93
# Checking Ca: nodeguid 0x00066a0098007e8e
# Checking Ca: nodeguid 0x00066a0098007e90
# Checking Ca: nodeguid 0x00066a0098007e98
# Checking Ca: nodeguid 0x00066a0098007e95
# Checking Ca: nodeguid 0x00066a0098007e8f
# Checking Ca: nodeguid 0x00066a0098007e92
# Checking Ca: nodeguid 0x00066a0098007e8d
# Checking Ca: nodeguid 0x00066a0098007e91
# Checking Ca: nodeguid 0x00066a0098007e96
# Checking Ca: nodeguid 0x00066a0098007e9c
## Summary: 17 nodes checked, 0 bad nodes found
## 32 ports checked, 0 bad ports found
## 3 ports have errors beyond threshold
I see these messages in the switch log now:
E|2009/03/18 07:34:28.635S: Thread "esm_sar" (0x83394a90)
ESM: Embedded SM Error: sa_McMemberRecord_Set: Component mask of
0x0000000000010083 does not have bits required to create a group
(0x00000000000130C6) for new MGID of 0xFF12401BFFFF0000:00000000FFFFFFFF for
request from h2o12 HCA-1, Port 0x00066A00A0007E8E, LID 0x000C, returning
status 0x0600 : 0
I would have to assume that this is my problem, but how to fix?
jeff
Hal Rosenstock wrote:
On Tue, Mar 17, 2009 at 6:04 PM, jeffrey Lang <a class="moz-txt-link-rfc2396E" href="mailto:jrlang@uwyo.edu"><jrlang@uwyo.edu></a> wrote:
Here's the output smpquery portinfo -D 0 as requested below:
[root@h2o01 ~]# smpquery portinfo -D 0
# Port info: DR path 0 port 0
Mkey:............................0x0000000000000000
GidPrefix:.......................0xfe80000000000000
Lid:.............................0x0003
SMLid:...........................0x0001
CapMask:.........................0x2510a68
IsTrapSupported
IsAutomaticMigrationSupported
IsSLMappingSupported
IsLedInfoSupported
IsSystemImageGUIDsupported
IsCommunicatonManagementSupported
IsVendorClassSupported
IsCapabilityMaskNoticeSupported
IsClientRegistrationSupported
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................1
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
LinkSpeedActive:.................2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
NeighborMTU:.....................2048
SMSL:............................0
VLCap:...........................VL0-3
InitType:........................0x00
VLHighLimit:.....................0
VLArbHighCap:....................8
VLArbLowCap:.....................8
InitReply:.......................0x00
MtuCap:..........................2048
VLStallCount:....................7
HoqLife:.........................0
OperVLs:.........................VL0-3
PartEnforceInb:..................0
PartEnforceOutb:.................0
FilterRawInb:....................0
FilterRawOutb:...................0
MkeyViolations:..................0
PkeyViolations:..................0
QkeyViolations:..................0
GuidCap:.........................32
ClientReregister:................0
SubnetTimeout:...................17
RespTimeVal:.....................16
LocalPhysErr:....................15
OverrunErr:......................15
MaxCreditHint:...................0
RoundTrip:.......................0
Looks fine.
I did some checking, and It's not just this node having problems, all nodes
seem to be having this same problem.
Would you also run ibchecknet ?
What error messages are on the SM side ?
-- Hal
jeff
Hal Rosenstock wrote:
2009/3/17 jeffrey Lang <a class="moz-txt-link-rfc2396E" href="mailto:jrlang@uwyo.edu"><jrlang@uwyo.edu></a>:
First let me say, I hope this is the right list for this email, if not
please forgive me.
I have a small 16 node compute cluster. The university where I work at
recently opened a new Datacenter. My cluster was moved from the old
Datacenter. Before the move the inifiniband was working properly, after
the move the ipoib has stopped working.
The cluster runs Centos 4 with all the latest updates and the Centos
distributed OFED code. My plan was to update the OFED code once things had
restablized.
For the move, I shutdown the cluster, removed the inifiniband cables and the
cluster was moved. I then reinstalled the infiniband cables (not in the
same order before the move) and brought every thing back up.
When i brought the cluster back up the ipoib would not work. The only
message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast
join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".
I think that there may be a rate issue in terms of this node relative
to the IPoIB broadcast group which by default is 10 Gbps (4x SDR).
What does this node's portinfo show (smpquery portinfo -D 0) in terms
of link width and speed ?
-- Hal
The master node can see all the systems:
[root@h2o01 log]# ibnodes
Ca : 0x00066a0098007e99 ports 1 "h2o17 HCA-1"
Ca : 0x00066a0098007e9b ports 1 "h2o18 HCA-1"
Ca : 0x00066a0098007e97 ports 1 "h2o16 HCA-1"
Ca : 0x00066a0098007e8c ports 1 "h2o15 HCA-1"
Ca : 0x00066a0098007e94 ports 1 "h2o14 HCA-1"
Ca : 0x00066a0098007e93 ports 1 "h2o13 HCA-1"
Ca : 0x00066a0098007e8e ports 1 "h2o12 HCA-1"
Ca : 0x00066a0098007e90 ports 1 "h2o11 HCA-1"
Ca : 0x00066a0098007e98 ports 1 "h2o10 HCA-1"
Ca : 0x00066a0098007e95 ports 1 "h2o09 HCA-1"
Ca : 0x00066a0098007e8f ports 1 "h2o08 HCA-1"
Ca : 0x00066a0098007e92 ports 1 "h2o07 HCA-1"
Ca : 0x00066a0098007e8d ports 1 "h2o06 HCA-1"
Ca : 0x00066a0098007e91 ports 1 "h2o05 HCA-1"
Ca : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1"
Ca : 0x00066a0098007e9c ports 1 "h2o01 HCA-1"
Switch : 0x00066a00d8000593 ports 24 "SilverStorm 9024
GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0
I've reset the sm on the switch, but nothing seems to work.
Any ideas of where to look for whats causing the problem?
jeff
_______________________________________________
general mailing list
<a class="moz-txt-link-abbreviated" href="mailto:general@lists.openfabrics.org">general@lists.openfabrics.org</a>
<a class="moz-txt-link-freetext" href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general">http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general</a>
To unsubscribe, please visit
<a class="moz-txt-link-freetext" href="http://openib.org/mailman/listinfo/openib-general">http://openib.org/mailman/listinfo/openib-general</a>
</pre>
</blockquote>
</blockquote>
</body>
</html>