<div dir="ltr"><div><div><div>Hi,<br><br></div>In some our back-to-back IB setups, we experience high latencies (and lowered bandwidth) in random test cases. When this happens, we also see PortRcvErrors (seen via perfquery).<br>
We are using Mellanox OFED 1.5.3. The hardware/firmware details of the IB cards used are as below:<br><br>[root@vsanqa7 ~]# ibstat<br>CA 'mlx4_0'<br>        CA type: MT4099<br>        Number of ports: 1<br>        Firmware version: 2.10.700<br>
        Hardware version: 0<br>        Node GUID: 0x002590ffff481618<br>        System image GUID: 0x002590ffff48161b<br>        Port 1:<br>                State: Active<br>                Physical state: LinkUp<br>                Rate: 40<br>
                Base lid: 2<br>                LMC: 0<br>                SM lid: 1<br>                Capability mask: 0x0251486a<br>                Port GUID: 0x002590ffff481619<br>                Link layer: InfiniBand<br>
<br>[root@vsanqa8 ~]# ibstat <br>CA 'mlx4_0'<br>        CA type: MT4099<br>        Number of ports: 1<br>        Firmware version: 2.10.700<br>        Hardware version: 0<br>        Node GUID: 0x002590ffff481614<br>
        System image GUID: 0x002590ffff481617<br>        Port 1:<br>                State: Active<br>                Physical state: LinkUp<br>                Rate: 40<br>                Base lid: 1<br>                LMC: 0<br>
                SM lid: 1<br>                Capability mask: 0x0251486a<br>                Port GUID: 0x002590ffff481615<br>                Link layer: InfiniBand<br><br></div><div><br></div><div>perfquery output before ib_send_bw test:<br>
</div><div><pre class="" id="comment_text_3"># Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............15814
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5403
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925583200
PortRcvData:.....................145715607
PortXmitPkts:....................10975597
PortRcvPkts:.....................8191613
PortXmitWait:....................7570


Run Ib_send_bw test:
[root@vsanqa7 ~]# ib_send_bw 
------------------------------------------------------------------
                    Send BW Test
 Number of qps   : 1
 Connection type : RC
 RX depth        : 600
 CQ Moderation   : 50
 Mtu             : 2048B
 Link type       : IB
 Max inline data : 0B
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
------------------------------------------------------------------
 local address: LID 0x02 QPN 0xde1b PSN 000000
 remote address: LID 0x01 QPN 0x64004a PSN 000000
------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
 65536      1000           -nan               42.71  

Which is too low 
<br>Perfquery after ib_send_bw test:<br><br># Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............20750
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5473 ====> the diff is about 70 errors
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925617151
PortRcvData:.....................167814727
PortXmitPkts:....................10977290
PortRcvPkts:.....................8234571
PortXmitWait:....................7570</pre>Once we hit this issue, any subsequent transfers on the IB links suffer high latency.<br></div>Reloading the drivers resolves this problem (service openibd restart)<br></div><div>
Another data point is that we have not seen this in switched setups.<br></div><div>Also, on the setup that sees this problem, we do not hit it everytime.<br></div><div><br></div><div>Has anyone seen this before?<br><br></div>
<div>Thanks much,<br>Pavan<br></div></div>