[openib-general] cable test/error count utilities?

Hal Rosenstock halr at voltaire.com
Fri Jun 3 03:14:34 PDT 2005


On Thu, 2005-06-02 at 20:25, Troy Benjegerdes wrote:
> Some of my problems seem to be from intermittent cables.. 
> 
> Is there anything for OpenIB that can read error counters?

Aside from pulling these from the driver via
/sys/class/infiniband/mthca0/ports/1/counters/, there is also perfquery
which displays the portcounters (which contains the error counters):

Usage: perfquery [-d(ebug) -G(uid_addr) -a(ll_ports) -r(reset_after_read) -C ca_name -P hca_port -R(eset_only) -t timeout_ms -V(ersion) -h(elp)] [<lid|guid> [[port] [reset_mask]]]
        Examples:
                perfquery               # read local port's performance counters
                perfquery 32 1          # read performance counters from lid 32, port 1
                perfquery -a 32         # read performance counters from lid 32, all ports
                perfquery -r 32 1       # read performance counters and reset
                perfquery -R 32 1       # reset performance counters of port 1 only
                perfquery -R -a 32      # reset performance counters of all ports
                perfquery -R 32 2 0xf000        # reset only non-error counters of port 2

perfquery 2 1
# Port counters: Lid 0x2 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................1506
LinkRecovers:....................255
LinkDowned:......................1
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtBytes:........................2612
RcvBytes:........................2160
XmtPkts:.........................36
RcvBytes:........................30

> What I'd really like to see is something that I can integrate with
> nagios ( http://www.nagios.org/about ) 

Nagios says it runs external plugins so it would be possible to create
one for this which based on polling counters at some rate could cause
the contact notifications to be issued based on some algorithm for
deciding that this is appropriate (e.g. error counters are increasing so
a cable might be intermittent (e.g. certain link is suspect)).

-- Hal




More information about the general mailing list