[ewg] PortRcvErrors in back-to-back IB connections
pavan.tc
pavan.tc at gmail.com
Wed Aug 14 02:30:01 PDT 2013
Hi,
In some our back-to-back IB setups, we experience high latencies (and
lowered bandwidth) in random test cases. When this happens, we also see
PortRcvErrors (seen via perfquery).
We are using Mellanox OFED 1.5.3. The hardware/firmware details of the IB
cards used are as below:
[root at vsanqa7 ~]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.10.700
Hardware version: 0
Node GUID: 0x002590ffff481618
System image GUID: 0x002590ffff48161b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x002590ffff481619
Link layer: InfiniBand
[root at vsanqa8 ~]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.10.700
Hardware version: 0
Node GUID: 0x002590ffff481614
System image GUID: 0x002590ffff481617
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x002590ffff481615
Link layer: InfiniBand
perfquery output before ib_send_bw test:
# Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............15814
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5403
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925583200
PortRcvData:.....................145715607
PortXmitPkts:....................10975597
PortRcvPkts:.....................8191613
PortXmitWait:....................7570
Run Ib_send_bw test:
[root at vsanqa7 ~]# ib_send_bw
------------------------------------------------------------------
Send BW Test
Number of qps : 1
Connection type : RC
RX depth : 600
CQ Moderation : 50
Mtu : 2048B
Link type : IB
Max inline data : 0B
rdma_cm QPs : OFF
Data ex. method : Ethernet
------------------------------------------------------------------
local address: LID 0x02 QPN 0xde1b PSN 000000
remote address: LID 0x01 QPN 0x64004a PSN 000000
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536 1000 -nan 42.71
Which is too low
Perfquery after ib_send_bw test:
# Port counters: Lid 2 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............20750
LinkErrorRecoveryCounter:........255
LinkDownedCounter:...............0
PortRcvErrors:...................5473 ====> the diff is about 70 errors
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................2925617151
PortRcvData:.....................167814727
PortXmitPkts:....................10977290
PortRcvPkts:.....................8234571
PortXmitWait:....................7570
Once we hit this issue, any subsequent transfers on the IB links suffer
high latency.
Reloading the drivers resolves this problem (service openibd restart)
Another data point is that we have not seen this in switched setups.
Also, on the setup that sees this problem, we do not hit it everytime.
Has anyone seen this before?
Thanks much,
Pavan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130814/bcf80077/attachment.html>
More information about the ewg
mailing list