[ewg] PortRcvErrors in back-to-back IB connections

pavan.tc pavan.tc at gmail.com
Wed Aug 14 02:30:01 PDT 2013


Hi,

In some our back-to-back IB setups, we experience high latencies (and
lowered bandwidth) in random test cases. When this happens, we also see
PortRcvErrors (seen via perfquery).
We are using Mellanox OFED 1.5.3. The hardware/firmware details of the IB
cards used are as below:

[root at vsanqa7 ~]# ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.10.700
        Hardware version: 0
        Node GUID: 0x002590ffff481618
        System image GUID: 0x002590ffff48161b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x002590ffff481619
                Link layer: InfiniBand

[root at vsanqa8 ~]# ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.10.700
        Hardware version: 0
        Node GUID: 0x002590ffff481614
        System image GUID: 0x002590ffff481617
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x002590ffff481615
                Link layer: InfiniBand


perfquery output before ib_send_bw test:

# Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............15814
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5403
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925583200
PortRcvData:.....................145715607
PortXmitPkts:....................10975597
PortRcvPkts:.....................8191613
PortXmitWait:....................7570


Run Ib_send_bw test:
[root at vsanqa7 ~]# ib_send_bw
------------------------------------------------------------------
                    Send BW Test
 Number of qps   : 1
 Connection type : RC
 RX depth        : 600
 CQ Moderation   : 50
 Mtu             : 2048B
 Link type       : IB
 Max inline data : 0B
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
------------------------------------------------------------------
 local address: LID 0x02 QPN 0xde1b PSN 000000
 remote address: LID 0x01 QPN 0x64004a PSN 000000
------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
 65536      1000           -nan               42.71

Which is too low
Perfquery after ib_send_bw test:

# Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............20750
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5473 ====> the diff is about 70 errors
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925617151
PortRcvData:.....................167814727
PortXmitPkts:....................10977290
PortRcvPkts:.....................8234571
PortXmitWait:....................7570

Once we hit this issue, any subsequent transfers on the IB links suffer
high latency.
Reloading the drivers resolves this problem (service openibd restart)
Another data point is that we have not seen this in switched setups.
Also, on the setup that sees this problem, we do not hit it everytime.

Has anyone seen this before?

Thanks much,
Pavan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130814/bcf80077/attachment.html>


More information about the ewg mailing list