<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

I'm using the version currently that is released with Centos 4.X which

shows as "infiniband-diags-1.3.6-1.el4".   found this in the syslog <br>

<br>

Mar 18 13:47:43 h2o01 saquery[21966]: Unable to Open IBT

Device[/dev/SysIbt]<br>

<br>

Now i just need to figure out why the device entry doesn't exist.<br>

<br>

With the firmware update the errors below have disappeared and the

IPOIB is now working.   <br>

<br>

jeff<br>

<br>

Hal Rosenstock wrote:

<blockquote

 cite="mid:f0e08f230903181402l79e5e76bm96b59c1f7e49cf2b@mail.gmail.com"

 type="cite">

  <pre wrap="">On Wed, Mar 18, 2009 at 10:37 AM, jeffrey Lang <a class="moz-txt-link-rfc2396E" href="mailto:jrlang@uwyo.edu"><jrlang@uwyo.edu></a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Here's the output for ibchecknet:

[root@h2o01 ~]# ibchecknet

perfquery: iberror: failed: smp query nodeinfo: Node type not CA

    </pre>

  </blockquote>

  <pre wrap=""><!---->

What diags version is being used ?

  </pre>

  <blockquote type="cite">

    <pre wrap="">Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port all:

FAILED

#warn: counter SymbolErrors = 43259     (threshold 10) lid 1 port 17

Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 17:

FAILED

#warn: counter LinkRecovers = 207       (threshold 10) lid 1 port 2

#warn: counter RcvErrors = 112  (threshold 10) lid 1 port 2

Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 2:

FAILED

#warn: counter LinkDowned = 10  (threshold 10) lid 1 port 1

#warn: counter RcvErrors = 95   (threshold 10) lid 1 port 1

Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 1:

FAILED

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Are the counts for these ports (1,2,17) changing ? You can look with

perfquery 1 <port #>.

This is a separate issue from the lack of an IPoIB broadcast group

assuming these numbers are incrementing.

-- Hal

  </pre>

  <blockquote type="cite">

    <pre wrap=""># Checking Ca: nodeguid 0x00066a0098007e99

# Checking Ca: nodeguid 0x00066a0098007e9b

# Checking Ca: nodeguid 0x00066a0098007e97

# Checking Ca: nodeguid 0x00066a0098007e8c

# Checking Ca: nodeguid 0x00066a0098007e94

# Checking Ca: nodeguid 0x00066a0098007e93

# Checking Ca: nodeguid 0x00066a0098007e8e

# Checking Ca: nodeguid 0x00066a0098007e90

# Checking Ca: nodeguid 0x00066a0098007e98

# Checking Ca: nodeguid 0x00066a0098007e95

# Checking Ca: nodeguid 0x00066a0098007e8f

# Checking Ca: nodeguid 0x00066a0098007e92

# Checking Ca: nodeguid 0x00066a0098007e8d

# Checking Ca: nodeguid 0x00066a0098007e91

# Checking Ca: nodeguid 0x00066a0098007e96

# Checking Ca: nodeguid 0x00066a0098007e9c

## Summary: 17 nodes checked, 0 bad nodes found

##          32 ports checked, 0 bad ports found

##          3 ports have errors beyond threshold

I see these messages in the switch log now:

E|2009/03/18 07:34:28.635S: Thread "esm_sar" (0x83394a90)

        ESM: Embedded SM Error: sa_McMemberRecord_Set: Component mask of

0x0000000000010083 does not have bits required to create a group

(0x00000000000130C6) for new MGID of 0xFF12401BFFFF0000:00000000FFFFFFFF for

request from h2o12 HCA-1, Port 0x00066A00A0007E8E, LID 0x000C, returning

status 0x0600 : 0

I would have to assume that this is my problem, but how to fix?

jeff

Hal Rosenstock wrote:

On Tue, Mar 17, 2009 at 6:04 PM, jeffrey Lang <a class="moz-txt-link-rfc2396E" href="mailto:jrlang@uwyo.edu"><jrlang@uwyo.edu></a> wrote:

Here's the output smpquery portinfo -D 0 as requested below:

[root@h2o01 ~]# smpquery portinfo -D 0

# Port info: DR path 0 port 0

Mkey:............................0x0000000000000000

GidPrefix:.......................0xfe80000000000000

Lid:.............................0x0003

SMLid:...........................0x0001

CapMask:.........................0x2510a68

                IsTrapSupported

                IsAutomaticMigrationSupported

                IsSLMappingSupported

                IsLedInfoSupported

                IsSystemImageGUIDsupported

                IsCommunicatonManagementSupported

                IsVendorClassSupported

                IsCapabilityMaskNoticeSupported

                IsClientRegistrationSupported

DiagCode:........................0x0000

MkeyLeasePeriod:.................0

LocalPort:.......................1

LinkWidthEnabled:................1X or 4X

LinkWidthSupported:..............1X or 4X

LinkWidthActive:.................4X

LinkSpeedSupported:..............2.5 Gbps

LinkState:.......................Active

PhysLinkState:...................LinkUp

LinkDownDefState:................Polling

ProtectBits:.....................0

LMC:.............................0

LinkSpeedActive:.................2.5 Gbps

LinkSpeedEnabled:................2.5 Gbps

NeighborMTU:.....................2048

SMSL:............................0

VLCap:...........................VL0-3

InitType:........................0x00

VLHighLimit:.....................0

VLArbHighCap:....................8

VLArbLowCap:.....................8

InitReply:.......................0x00

MtuCap:..........................2048

VLStallCount:....................7

HoqLife:.........................0

OperVLs:.........................VL0-3

PartEnforceInb:..................0

PartEnforceOutb:.................0

FilterRawInb:....................0

FilterRawOutb:...................0

MkeyViolations:..................0

PkeyViolations:..................0

QkeyViolations:..................0

GuidCap:.........................32

ClientReregister:................0

SubnetTimeout:...................17

RespTimeVal:.....................16

LocalPhysErr:....................15

OverrunErr:......................15

MaxCreditHint:...................0

RoundTrip:.......................0

Looks fine.

I did some checking, and It's not just this node having problems, all nodes

seem to be having this same problem.

Would you also run ibchecknet ?

What error messages are on the SM side ?

-- Hal

jeff

Hal Rosenstock wrote:

2009/3/17 jeffrey Lang <a class="moz-txt-link-rfc2396E" href="mailto:jrlang@uwyo.edu"><jrlang@uwyo.edu></a>:

First let me say, I hope this is the right list for this email, if not

please forgive me.

I have a small 16 node compute cluster.    The university where I work at

recently opened a new Datacenter.  My cluster was moved from the old

Datacenter.   Before the move the inifiniband was working properly, after

the move the ipoib has stopped working.

The cluster runs Centos 4 with all the latest updates and the Centos

distributed OFED code.   My plan was to update the OFED code once things had

restablized.

For the move, I shutdown the cluster, removed the inifiniband cables and the

cluster was moved.   I then reinstalled the infiniband cables (not in the

same order before the move) and brought every thing back up.

When i brought the cluster back up the ipoib would not work.  The only

message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast

join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".

I think that there may be a rate issue in terms of this node relative

to the IPoIB broadcast group which by default is 10 Gbps (4x SDR).

What does this node's portinfo show (smpquery portinfo -D 0) in terms

of link width and speed ?

-- Hal

The master node can see all the systems:

[root@h2o01 log]# ibnodes

Ca    : 0x00066a0098007e99 ports 1 "h2o17 HCA-1"

Ca    : 0x00066a0098007e9b ports 1 "h2o18 HCA-1"

Ca    : 0x00066a0098007e97 ports 1 "h2o16 HCA-1"

Ca    : 0x00066a0098007e8c ports 1 "h2o15 HCA-1"

Ca    : 0x00066a0098007e94 ports 1 "h2o14 HCA-1"

Ca    : 0x00066a0098007e93 ports 1 "h2o13 HCA-1"

Ca    : 0x00066a0098007e8e ports 1 "h2o12 HCA-1"

Ca    : 0x00066a0098007e90 ports 1 "h2o11 HCA-1"

Ca    : 0x00066a0098007e98 ports 1 "h2o10 HCA-1"

Ca    : 0x00066a0098007e95 ports 1 "h2o09 HCA-1"

Ca    : 0x00066a0098007e8f ports 1 "h2o08 HCA-1"

Ca    : 0x00066a0098007e92 ports 1 "h2o07 HCA-1"

Ca    : 0x00066a0098007e8d ports 1 "h2o06 HCA-1"

Ca    : 0x00066a0098007e91 ports 1 "h2o05 HCA-1"

Ca    : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1"

Ca    : 0x00066a0098007e9c ports 1 "h2o01 HCA-1"

Switch    : 0x00066a00d8000593 ports 24 "SilverStorm 9024

GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0

I've reset the sm on the switch, but nothing seems to work.

Any ideas of where to look for whats causing the problem?

jeff

_______________________________________________

general mailing list

<a class="moz-txt-link-abbreviated" href="mailto:general@lists.openfabrics.org">general@lists.openfabrics.org</a>

<a class="moz-txt-link-freetext" href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general">http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general</a>

To unsubscribe, please visit

<a class="moz-txt-link-freetext" href="http://openib.org/mailman/listinfo/openib-general">http://openib.org/mailman/listinfo/openib-general</a>

    </pre>

  </blockquote>

</blockquote>

</body>

</html>