***SPAM*** Fwd: [ofa-general] Diagnostics output messages

Hal Rosenstock hal.rosenstock at gmail.com
Tue Sep 30 08:41:22 PDT 2008


---------- Forwarded message ----------
From: Hal Rosenstock <hal.rosenstock at gmail.com>
Date: Tue, Sep 30, 2008 at 11:39 AM
Subject: Re: [ofa-general] Diagnostics output messages
To: Ramiro Alba Queipo <raq at cttc.upc.edu>


On Tue, Sep 30, 2008 at 10:09 AM, Ramiro Alba Queipo <raq at cttc.upc.edu> wrote:
> On Tue, 2008-09-30 at 08:35 -0400, Hal Rosenstock wrote:
>> On Tue, Sep 30, 2008 at 6:51 AM, Ramiro Alba Queipo <raq at cttc.upc.edu> wrote:
>> > Hello everybody:
>> >
>> > We have just started to run a 22 nodes infiniband cluster (44 in a
>> > couple
>> > of months) under Ubuntu 8.04 and after carefully reading and testing
>> > OFED 1.3.1 diagnogstics packages (ibutils and infiniband-diags), I have
>> > got some messages I can not understand:
>> >
>> > * ibdiagnet -o . -t file.topo -s jff -pm
>> >
>> >
>> > -I---------------------------------------------------
>> > -I- IPoIB Subnets Check
>> > -I---------------------------------------------------
>> > -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps
>> > SL:0x00
>> > -W- Suboptimal rate for group. Lowest member rate:20Gbps >
>> > group-rate:10Gbps
>> >
>> >
>> > What does it mean?
>>
>> This means your subnet is pure DDR and the IPoIB broadcast group can
>> run at a higher rate than the default. This is done via OpenSM
>> configuration which is slightly different depending on which version
>> you are using.
>>
>
> OpenSM 3.1.11

See the man page on partition configuration for how to fix this. The
partition config file should contain:

Default=0x7fff,ipoib,rate=6:ALL=full;

since a rate of 6 is 20 Gbps

>> > * ibchecknet
>> >
>> > #warn: counter RcvSwRelayErrors = 259   (threshold 100) lid 4 port 255
>> > Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies)
>> > port all:  FAILED
>> >
>> >
>> > I could see that command 'perfquery -a 255' shows its counters, but:
>> >
>> >    - What is for?
>> >    - ibqueryerrors.pl -a says
>> >      RcvSwRelayErrors: This counter can increase due to a valid network
>> > event
>> >      Should I worry by switch ports increasing little by little this
>> > counter?
>> >
>> > I am using IPoIB
>>
>> Unfortunately when running IPoIB, RcvSwRelayErrors needs to be ignored
>> as multicasts are counted as looping.
>>
>> > * ibdiagpath -o . -t file.topo -s jff -n jff201
>> >
>> > -I---------------------------------------------------
>> > -I- QoS on Path Check
>> > -I---------------------------------------------------
>> > -W- VLArbTableLow Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001
>> >    guid=0x0002c90200279295 dev=25204 port:1
>> > -W- VLArbTableHigh Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001
>> >    guid=0x0002c90200279295 dev=25204 port:1
>> > -W- VLArbTableLow Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004
>> >    guid=0x000b8cffff0052cf dev=47396 port:1
>> > -W- VLArbTableHigh Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004
>> >    guid=0x000b8cffff0052cf dev=47396 port:1
>> > -W- SLs:6 7 14 15 mapped to VL > 5 at node:"switch-1/U1" lid=0x0004
>> >    guid=0x000b8cffff0052cf dev=47396 in-port:23 out-port:1
>> > -I- The following SLs can be used:0 1 2 3 4 5 8 9 10 11 12 13
>> >
>> > What is the meaning of this messages?
>>
>> I'm not sure but it looks like it's complaining about an invalid VL.
>> Can you run:
>> smpquery portinfo <lid> 1
>> smpquery sl2vl <lid> 1
>> smpquery vlarb <lid> 1
>> for both of these lids ?
>>
>
> # Port info: Lid 1 port 1
> Mkey:............................0x0000000000000000
> GidPrefix:.......................0xfe80000000000000
> Lid:.............................0x0001
> SMLid:...........................0x0001
> CapMask:.........................0x2510a6a
>                                IsSM
>                                IsTrapSupported
>                                IsAutomaticMigrationSupported
>                                IsSLMappingSupported
>                                IsLedInfoSupported
>                                IsSystemImageGUIDsupported
>                                IsCommunicatonManagementSupported
>                                IsVendorClassSupported
>                                IsCapabilityMaskNoticeSupported
>                                IsClientRegistrationSupported
> DiagCode:........................0x0000
> MkeyLeasePeriod:.................0
> LocalPort:.......................1
> LinkWidthEnabled:................1X or 4X
> LinkWidthSupported:..............1X or 4X
> LinkWidthActive:.................4X
> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
> LinkState:.......................Active
> PhysLinkState:...................LinkUp
> LinkDownDefState:................Polling
> ProtectBits:.....................0
> LMC:.............................0
> LinkSpeedActive:.................5.0 Gbps
> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
> NeighborMTU:.....................2048
> SMSL:............................0
> VLCap:...........................VL0-3
> InitType:........................0x00
> VLHighLimit:.....................0
> VLArbHighCap:....................8
> VLArbLowCap:.....................8
> InitReply:.......................0x00
> MtuCap:..........................2048
> VLStallCount:....................7
> HoqLife:.........................31
> OperVLs:.........................VL0-3
> PartEnforceInb:..................0
> PartEnforceOutb:.................0
> FilterRawInb:....................0
> FilterRawOutb:...................0
> MkeyViolations:..................0
> PkeyViolations:..................0
> QkeyViolations:..................0
> GuidCap:.........................32
> ClientReregister:................0
> SubnetTimeout:...................18
> RespTimeVal:.....................16
> LocalPhysErr:....................8
> OverrunErr:......................8
> MaxCreditHint:...................0
> RoundTrip:.......................0
>
> # SL2VL table: Lid 1
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 3| 2| 1| 0| 3| 2| 1| 0| 3| 2| 1| 0| 3| 2| 1| 0|
>
> # VLArbitration tables: Lid 1 port 1 LowCap 8 HighCap 8
> # Low priority VL Arbitration Table:
> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
> WEIGHT: |0x1 |0x1 |0x1 |0x1 |0x1 |0x1 |0x1 |0x1 |
> # High priority VL Arbitration Table:
> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
> WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
>
>
> # Port info: Lid 4 port 1
> Mkey:............................0x0000000000000000
> GidPrefix:.......................0x0000000000000000
> Lid:.............................0x0000
> SMLid:...........................0x0000
> CapMask:.........................0x0
> DiagCode:........................0x0000
> MkeyLeasePeriod:.................0
> LocalPort:.......................23
> LinkWidthEnabled:................1X or 4X
> LinkWidthSupported:..............1X or 4X
> LinkWidthActive:.................4X
> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
> LinkState:.......................Active
> PhysLinkState:...................LinkUp
> LinkDownDefState:................Polling
> ProtectBits:.....................0
> LMC:.............................0
> LinkSpeedActive:.................5.0 Gbps
> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
> NeighborMTU:.....................2048
> SMSL:............................0
> VLCap:...........................VL0-7
> InitType:........................0x00
> VLHighLimit:.....................0
> VLArbHighCap:....................8
> VLArbLowCap:.....................8
> InitReply:.......................0x00
> MtuCap:..........................2048
> VLStallCount:....................7
> HoqLife:.........................16
> OperVLs:.........................VL0-3
> PartEnforceInb:..................1
> PartEnforceOutb:.................1
> FilterRawInb:....................0
> FilterRawOutb:...................0
> MkeyViolations:..................0
> PkeyViolations:..................0
> QkeyViolations:..................0
> GuidCap:.........................0
> ClientReregister:................0
> SubnetTimeout:...................0
> RespTimeVal:.....................0
> LocalPhysErr:....................8
> OverrunErr:......................8
> MaxCreditHint:...................0
> RoundTrip:.......................0
>
> # SL2VL table: Lid 4
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  1: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
> ports: in  1, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  2, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  3, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  4, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  5, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  6, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  7, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  8, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in  9, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 10, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 11, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 12, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 13, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 14, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 15, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 16, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 17, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 18, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 19, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 20, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 21, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 22, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 23, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ports: in 24, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
>
> # VLArbitration tables: Lid 4 port 1 LowCap 8 HighCap 8
> # Low priority VL Arbitration Table:
> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
> WEIGHT: |0x1 |0x1 |0x1 |0x1 |0x1 |0x1 |0x1 |0x1 |
> # High priority VL Arbitration Table:
> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
> WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

I see what it's complaining about:

ibdiag/src/ibdebug_if.tcl has the following snippet of code:
     "-W-ibdiagpath:qos.vlaOverOpVLs" {
         foreach {name port entries opVLs HL} $args {break}
         set lastVL [expr $opVLs - 1]
         if {$lastVL == 15} {set lastVL 14}
         append msgText "VLArbTable$HL Entries:$entries VL > $lastVL at node:
$name port:$port"
     }

There's a similar snipper for the low arb table.

If I'm reading this right, those code snippets look wrong to me since
it is valid to have the same VL entry in there more than once. The
limit which can't be exceeded is the VLArbHigh/LowCap.

In terms of the SL mapping,
# SL2VL table: Lid 4
#                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  1, out  1: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|

I think it's complaining about SLs 5-7 being mapped to non operational
VLs. That is also valid but means those SLs would be dropped and not
sure if that is what is intended.

-- Hal

>
>> -- Hal
>>
>> > Finally, and not related to diagnostics messages, I have to change
>> > permissions at
>> >
>> > crw-rw---- 1 root rdma 231, 192 2008-09-30 09:19 /dev/infiniband/uverbs0
>> >
>> > to be 'rw' to everybody.
>> >
>> > Should I add users to 'rdma' group instead?
>> >
>> >
>> > ---
>> > Thanks in advance
>> >
>> > Regards
>> >
>> >
>> > --
>> > Aquest missatge ha estat analitzat per MailScanner
>> > a la cerca de virus i d'altres continguts perillosos,
>> > i es considera que està net.
>> > For all your IT requirements visit: http://www.transtec.co.uk
>> >
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org
>> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >
>> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >
>>
>
>
> --
> Aquest missatge ha estat analitzat per MailScanner
> a la cerca de virus i d'altres continguts perillosos,
> i es considera que està net.
> For all your IT requirements visit: http://www.transtec.co.uk
>
>



More information about the general mailing list