<div dir="ltr"><div><div><div>Hi,<br><br></div>In some our back-to-back IB setups, we experience high latencies (and lowered bandwidth) in random test cases. When this happens, we also see PortRcvErrors (seen via perfquery).<br>
We are using Mellanox OFED 1.5.3. The hardware/firmware details of the IB cards used are as below:<br><br>[root@vsanqa7 ~]# ibstat<br>CA 'mlx4_0'<br> CA type: MT4099<br> Number of ports: 1<br> Firmware version: 2.10.700<br>
Hardware version: 0<br> Node GUID: 0x002590ffff481618<br> System image GUID: 0x002590ffff48161b<br> Port 1:<br> State: Active<br> Physical state: LinkUp<br> Rate: 40<br>
Base lid: 2<br> LMC: 0<br> SM lid: 1<br> Capability mask: 0x0251486a<br> Port GUID: 0x002590ffff481619<br> Link layer: InfiniBand<br>
<br>[root@vsanqa8 ~]# ibstat <br>CA 'mlx4_0'<br> CA type: MT4099<br> Number of ports: 1<br> Firmware version: 2.10.700<br> Hardware version: 0<br> Node GUID: 0x002590ffff481614<br>
System image GUID: 0x002590ffff481617<br> Port 1:<br> State: Active<br> Physical state: LinkUp<br> Rate: 40<br> Base lid: 1<br> LMC: 0<br>
SM lid: 1<br> Capability mask: 0x0251486a<br> Port GUID: 0x002590ffff481615<br> Link layer: InfiniBand<br><br></div><div><br></div><div>perfquery output before ib_send_bw test:<br>
</div><div><pre class="" id="comment_text_3"># Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............15814
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5403
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925583200
PortRcvData:.....................145715607
PortXmitPkts:....................10975597
PortRcvPkts:.....................8191613
PortXmitWait:....................7570
Run Ib_send_bw test:
[root@vsanqa7 ~]# ib_send_bw
------------------------------------------------------------------
Send BW Test
Number of qps : 1
Connection type : RC
RX depth : 600
CQ Moderation : 50
Mtu : 2048B
Link type : IB
Max inline data : 0B
rdma_cm QPs : OFF
Data ex. method : Ethernet
------------------------------------------------------------------
local address: LID 0x02 QPN 0xde1b PSN 000000
remote address: LID 0x01 QPN 0x64004a PSN 000000
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536 1000 -nan 42.71
Which is too low
<br>Perfquery after ib_send_bw test:<br><br># Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............20750
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5473 ====> the diff is about 70 errors
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925617151
PortRcvData:.....................167814727
PortXmitPkts:....................10977290
PortRcvPkts:.....................8234571
PortXmitWait:....................7570</pre>Once we hit this issue, any subsequent transfers on the IB links suffer high latency.<br></div>Reloading the drivers resolves this problem (service openibd restart)<br></div><div>
Another data point is that we have not seen this in switched setups.<br></div><div>Also, on the setup that sees this problem, we do not hit it everytime.<br></div><div><br></div><div>Has anyone seen this before?<br><br></div>
<div>Thanks much,<br>Pavan<br></div></div>