[ofa-general] Diagnostics output messages

Ramiro Alba Queipo raq at cttc.upc.edu
Tue Sep 30 03:51:32 PDT 2008


Hello everybody:

We have just started to run a 22 nodes infiniband cluster (44 in a
couple
of months) under Ubuntu 8.04 and after carefully reading and testing 
OFED 1.3.1 diagnogstics packages (ibutils and infiniband-diags), I have 
got some messages I can not understand:

* ibdiagnet -o . -t file.topo -s jff -pm


-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps
SL:0x00
-W- Suboptimal rate for group. Lowest member rate:20Gbps >
group-rate:10Gbps


What does it mean?

* ibchecknet 

#warn: counter RcvSwRelayErrors = 259   (threshold 100) lid 4 port 255
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies)
port all:  FAILED 


I could see that command 'perfquery -a 255' shows its counters, but:

    - What is for?
    - ibqueryerrors.pl -a says 
      RcvSwRelayErrors: This counter can increase due to a valid network
event
      Should I worry by switch ports increasing little by little this
counter?

I am using IPoIB

* ibdiagpath -o . -t file.topo -s jff -n jff201

-I---------------------------------------------------
-I- QoS on Path Check
-I---------------------------------------------------
-W- VLArbTableLow Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001
    guid=0x0002c90200279295 dev=25204 port:1
-W- VLArbTableHigh Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001
    guid=0x0002c90200279295 dev=25204 port:1
-W- VLArbTableLow Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004
    guid=0x000b8cffff0052cf dev=47396 port:1
-W- VLArbTableHigh Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004
    guid=0x000b8cffff0052cf dev=47396 port:1
-W- SLs:6 7 14 15 mapped to VL > 5 at node:"switch-1/U1" lid=0x0004
    guid=0x000b8cffff0052cf dev=47396 in-port:23 out-port:1
-I- The following SLs can be used:0 1 2 3 4 5 8 9 10 11 12 13

What is the meaning of this messages? 


Finally, and not related to diagnostics messages, I have to change
permissions at

crw-rw---- 1 root rdma 231, 192 2008-09-30 09:19 /dev/infiniband/uverbs0

to be 'rw' to everybody. 

Should I add users to 'rdma' group instead?


---
Thanks in advance

Regards


-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk




More information about the general mailing list