***SPAM*** Re: [ofa-general] Diagnostics output messages

Hal Rosenstock hal.rosenstock at gmail.com
Tue Sep 30 05:35:20 PDT 2008


On Tue, Sep 30, 2008 at 6:51 AM, Ramiro Alba Queipo <raq at cttc.upc.edu> wrote:
> Hello everybody:
>
> We have just started to run a 22 nodes infiniband cluster (44 in a
> couple
> of months) under Ubuntu 8.04 and after carefully reading and testing
> OFED 1.3.1 diagnogstics packages (ibutils and infiniband-diags), I have
> got some messages I can not understand:
>
> * ibdiagnet -o . -t file.topo -s jff -pm
>
>
> -I---------------------------------------------------
> -I- IPoIB Subnets Check
> -I---------------------------------------------------
> -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps
> SL:0x00
> -W- Suboptimal rate for group. Lowest member rate:20Gbps >
> group-rate:10Gbps
>
>
> What does it mean?

This means your subnet is pure DDR and the IPoIB broadcast group can
run at a higher rate than the default. This is done via OpenSM
configuration which is slightly different depending on which version
you are using.

> * ibchecknet
>
> #warn: counter RcvSwRelayErrors = 259   (threshold 100) lid 4 port 255
> Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies)
> port all:  FAILED
>
>
> I could see that command 'perfquery -a 255' shows its counters, but:
>
>    - What is for?
>    - ibqueryerrors.pl -a says
>      RcvSwRelayErrors: This counter can increase due to a valid network
> event
>      Should I worry by switch ports increasing little by little this
> counter?
>
> I am using IPoIB

Unfortunately when running IPoIB, RcvSwRelayErrors needs to be ignored
as multicasts are counted as looping.

> * ibdiagpath -o . -t file.topo -s jff -n jff201
>
> -I---------------------------------------------------
> -I- QoS on Path Check
> -I---------------------------------------------------
> -W- VLArbTableLow Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001
>    guid=0x0002c90200279295 dev=25204 port:1
> -W- VLArbTableHigh Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001
>    guid=0x0002c90200279295 dev=25204 port:1
> -W- VLArbTableLow Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004
>    guid=0x000b8cffff0052cf dev=47396 port:1
> -W- VLArbTableHigh Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004
>    guid=0x000b8cffff0052cf dev=47396 port:1
> -W- SLs:6 7 14 15 mapped to VL > 5 at node:"switch-1/U1" lid=0x0004
>    guid=0x000b8cffff0052cf dev=47396 in-port:23 out-port:1
> -I- The following SLs can be used:0 1 2 3 4 5 8 9 10 11 12 13
>
> What is the meaning of this messages?

I'm not sure but it looks like it's complaining about an invalid VL.
Can you run:
smpquery portinfo <lid> 1
smpquery sl2vl <lid> 1
smpquery vlarb <lid> 1
for both of these lids ?

-- Hal

> Finally, and not related to diagnostics messages, I have to change
> permissions at
>
> crw-rw---- 1 root rdma 231, 192 2008-09-30 09:19 /dev/infiniband/uverbs0
>
> to be 'rw' to everybody.
>
> Should I add users to 'rdma' group instead?
>
>
> ---
> Thanks in advance
>
> Regards
>
>
> --
> Aquest missatge ha estat analitzat per MailScanner
> a la cerca de virus i d'altres continguts perillosos,
> i es considera que està net.
> For all your IT requirements visit: http://www.transtec.co.uk
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list