[ewg] PortRcvErrors in back-to-back IB connections
Hal Rosenstock
hal at dev.mellanox.co.il
Fri Aug 16 07:06:54 PDT 2013
On 8/14/2013 5:30 AM, pavan.tc wrote:
> Hi,
>
> In some our back-to-back IB setups, we experience high latencies (and
> lowered bandwidth) in random test cases. When this happens, we also see
> PortRcvErrors (seen via perfquery).
> We are using Mellanox OFED 1.5.3. The hardware/firmware details of the
> IB cards used are as below:
>
> [root at vsanqa7 ~]# ibstat
> CA 'mlx4_0'
> CA type: MT4099
> Number of ports: 1
> Firmware version: 2.10.700
> Hardware version: 0
> Node GUID: 0x002590ffff481618
> System image GUID: 0x002590ffff48161b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 2
> LMC: 0
> SM lid: 1
> Capability mask: 0x0251486a
> Port GUID: 0x002590ffff481619
> Link layer: InfiniBand
>
> [root at vsanqa8 ~]# ibstat
> CA 'mlx4_0'
> CA type: MT4099
> Number of ports: 1
> Firmware version: 2.10.700
> Hardware version: 0
> Node GUID: 0x002590ffff481614
> System image GUID: 0x002590ffff481617
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 1
> LMC: 0
> SM lid: 1
> Capability mask: 0x0251486a
> Port GUID: 0x002590ffff481615
> Link layer: InfiniBand
>
>
> perfquery output before ib_send_bw test:
>
> # Port counters: Lid 2 port 1
> PortSelect:......................1
> CounterSelect:...................0x1400
> SymbolErrorCounter:..............15814
> LinkErrorRecoveryCounter:........255
> LinkDownedCounter:...............0
> PortRcvErrors:...................5403
> PortRcvRemotePhysicalErrors:.....0
> PortRcvSwitchRelayErrors:........0
> PortXmitDiscards:................0
> PortXmitConstraintErrors:........0
> PortRcvConstraintErrors:.........0
> CounterSelect2:..................0x00
> LocalLinkIntegrityErrors:........0
> ExcessiveBufferOverrunErrors:....0
> VL15Dropped:.....................0
> PortXmitData:....................2925583200
> PortRcvData:.....................145715607
> PortXmitPkts:....................10975597
> PortRcvPkts:.....................8191613
> PortXmitWait:....................7570
>
>
> Run Ib_send_bw test:
> [root at vsanqa7 ~]# ib_send_bw
> ------------------------------------------------------------------
> Send BW Test
> Number of qps : 1
> Connection type : RC
> RX depth : 600
> CQ Moderation : 50
> Mtu : 2048B
> Link type : IB
> Max inline data : 0B
> rdma_cm QPs : OFF
> Data ex. method : Ethernet
> ------------------------------------------------------------------
> local address: LID 0x02 QPN 0xde1b PSN 000000
> remote address: LID 0x01 QPN 0x64004a PSN 000000
> ------------------------------------------------------------------
> #bytes #iterations BW peak[MB/sec] BW average[MB/sec]
> 65536 1000 -nan 42.71
>
> Which is too low
>
> Perfquery after ib_send_bw test:
>
> # Port counters: Lid 2 port 1
> PortSelect:......................1
> CounterSelect:...................0x1400
> SymbolErrorCounter:..............20750
Are symbol errors increasing ?
> LinkErrorRecoveryCounter:........255
Could it be that your link goes through error recovery as indicated by
this counter being max'd out ?
Can you clear this counter and see if it increments ?
> LinkDownedCounter:...............0
> PortRcvErrors:...................5473 ====> the diff is about 70 errors
Port receive errors indicate one of the following:
• Local physical errors (ICRC, VCRC, LPCRC, and all physical
errors that cause entry into the BAD PACKET or BAD PACKET
DISCARD states of the packet receiver state machine)
• Malformed data packet errors (LVer, length, VL)
• Malformed link packet errors (operand, length, VL)
• Packets discarded due to buffer overrun
> PortRcvRemotePhysicalErrors:.....0
> PortRcvSwitchRelayErrors:........0
> PortXmitDiscards:................0
> PortXmitConstraintErrors:........0
> PortRcvConstraintErrors:.........0
> CounterSelect2:..................0x00
> LocalLinkIntegrityErrors:........0
> ExcessiveBufferOverrunErrors:....0
> VL15Dropped:.....................0
> PortXmitData:....................2925617151
> PortRcvData:.....................167814727
> PortXmitPkts:....................10977290
> PortRcvPkts:.....................8234571
> PortXmitWait:....................7570
>
> Once we hit this issue, any subsequent transfers on the IB links suffer
> high latency.
> Reloading the drivers resolves this problem (service openibd restart)
> Another data point is that we have not seen this in switched setups.
> Also, on the setup that sees this problem, we do not hit it everytime.
>
> Has anyone seen this before?
I suspect the link is retraining due to minor errors over threshold or
major errors.
Can you try some other known good cable ?
-- Hal
>
> Thanks much,
> Pavan
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
More information about the ewg
mailing list