[ewg] PortRcvErrors in back-to-back IB connections

Hal Rosenstock hal at dev.mellanox.co.il
Fri Aug 16 07:06:54 PDT 2013


On 8/14/2013 5:30 AM, pavan.tc wrote:
> Hi,
> 
> In some our back-to-back IB setups, we experience high latencies (and
> lowered bandwidth) in random test cases. When this happens, we also see
> PortRcvErrors (seen via perfquery).
> We are using Mellanox OFED 1.5.3. The hardware/firmware details of the
> IB cards used are as below:
> 
> [root at vsanqa7 ~]# ibstat
> CA 'mlx4_0'
>         CA type: MT4099
>         Number of ports: 1
>         Firmware version: 2.10.700
>         Hardware version: 0
>         Node GUID: 0x002590ffff481618
>         System image GUID: 0x002590ffff48161b
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 2
>                 LMC: 0
>                 SM lid: 1
>                 Capability mask: 0x0251486a
>                 Port GUID: 0x002590ffff481619
>                 Link layer: InfiniBand
> 
> [root at vsanqa8 ~]# ibstat
> CA 'mlx4_0'
>         CA type: MT4099
>         Number of ports: 1
>         Firmware version: 2.10.700
>         Hardware version: 0
>         Node GUID: 0x002590ffff481614
>         System image GUID: 0x002590ffff481617
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 1
>                 LMC: 0
>                 SM lid: 1
>                 Capability mask: 0x0251486a
>                 Port GUID: 0x002590ffff481615
>                 Link layer: InfiniBand
> 
> 
> perfquery output before ib_send_bw test:
> 
> # Port counters: Lid 2 port 1
> PortSelect:......................1
> CounterSelect:...................0x1400
> SymbolErrorCounter:..............15814
> LinkErrorRecoveryCounter:........255
> LinkDownedCounter:...............0
> PortRcvErrors:...................5403
> PortRcvRemotePhysicalErrors:.....0
> PortRcvSwitchRelayErrors:........0
> PortXmitDiscards:................0
> PortXmitConstraintErrors:........0
> PortRcvConstraintErrors:.........0
> CounterSelect2:..................0x00
> LocalLinkIntegrityErrors:........0
> ExcessiveBufferOverrunErrors:....0
> VL15Dropped:.....................0
> PortXmitData:....................2925583200
> PortRcvData:.....................145715607
> PortXmitPkts:....................10975597
> PortRcvPkts:.....................8191613
> PortXmitWait:....................7570
> 
> 
> Run Ib_send_bw test:
> [root at vsanqa7 ~]# ib_send_bw 
> ------------------------------------------------------------------
>                     Send BW Test
>  Number of qps   : 1
>  Connection type : RC
>  RX depth        : 600
>  CQ Moderation   : 50
>  Mtu             : 2048B
>  Link type       : IB
>  Max inline data : 0B
>  rdma_cm QPs	 : OFF
>  Data ex. method : Ethernet
> ------------------------------------------------------------------
>  local address: LID 0x02 QPN 0xde1b PSN 000000
>  remote address: LID 0x01 QPN 0x64004a PSN 000000
> ------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>  65536      1000           -nan               42.71  
> 
> Which is too low 
> 
> Perfquery after ib_send_bw test:
> 
> # Port counters: Lid 2 port 1
> PortSelect:......................1
> CounterSelect:...................0x1400
> SymbolErrorCounter:..............20750

Are symbol errors increasing ?

> LinkErrorRecoveryCounter:........255

Could it be that your link goes through error recovery as indicated by
this counter being max'd out ?

Can you clear this counter and see if it increments ?

> LinkDownedCounter:...............0
> PortRcvErrors:...................5473 ====> the diff is about 70 errors

Port receive errors indicate one of the following:
• Local physical errors (ICRC, VCRC, LPCRC, and all physical
errors that cause entry into the BAD PACKET or BAD PACKET
DISCARD states of the packet receiver state machine)
• Malformed data packet errors (LVer, length, VL)
• Malformed link packet errors (operand, length, VL)
• Packets discarded due to buffer overrun


> PortRcvRemotePhysicalErrors:.....0
> PortRcvSwitchRelayErrors:........0
> PortXmitDiscards:................0
> PortXmitConstraintErrors:........0
> PortRcvConstraintErrors:.........0
> CounterSelect2:..................0x00
> LocalLinkIntegrityErrors:........0
> ExcessiveBufferOverrunErrors:....0
> VL15Dropped:.....................0
> PortXmitData:....................2925617151
> PortRcvData:.....................167814727
> PortXmitPkts:....................10977290
> PortRcvPkts:.....................8234571
> PortXmitWait:....................7570
> 
> Once we hit this issue, any subsequent transfers on the IB links suffer
> high latency.
> Reloading the drivers resolves this problem (service openibd restart)
> Another data point is that we have not seen this in switched setups.
> Also, on the setup that sees this problem, we do not hit it everytime.
> 
> Has anyone seen this before?

I suspect the link is retraining due to minor errors over threshold or
major errors.

Can you try some other known good cable ?

-- Hal

> 
> Thanks much,
> Pavan
> 
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg




More information about the ewg mailing list