[Users] High PortXmitWait on a lot of ports - degraded performance

German Anders ganders at despegar.com
Wed Nov 25 04:29:51 PST 2015


Hi all,

I'm having some issues with my IB network, basically I have the following
setup (pdf attach). I run a fio test between the HP Blade with QDR (bonding
ports active/backup mode), to a storage cluster with FDR (with no bonding
at all), and the best result that I can get is 1.7GB/s, that's pretty slow
actually. However I was hopping something between 2.5-3.5GB/s on a QDR
infiniband network. Then I try to tweak some parameters, for example
setting the scaling_governor to 'performance', and set the 'connected' mode
in the ib ports, then change the following variables:

sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.core.rmem_max=4194304
sysctl -w net.core.wmem_max=4194304
sysctl -w net.core.rmem_default=4194304
sysctl -w net.core.wmem_default=4194304
sysctl -w net.core.optmem_max=4194304
sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
sysctl -w net.ipv4.tcp_low_latency=1

The bond configuration is the following:

# cat /etc/modprobe.d/bonding.conf

alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100
updelay=100 max_bonds=2


# cat /etc/network/interfaces

(...)

## INFINIBAND CONF
auto ib0
iface ib0 inet manual
        bond-master bond-ib

auto ib1
iface ib1 inet manual
        bond-master bond-ib

auto bond-ib
iface bond-ib inet static
    address 172.23.18.1
    netmask 255.255.240.0
    slaves ib0 ib1
    bond_miimon 100
    bond_mode active-backup
    pre-up echo connected > /sys/class/net/ib0/mode
    pre-up echo connected > /sys/class/net/ib1/mode
    pre-up /sbin/ifconfig ib0 mtu 65520
    pre-up /sbin/ifconfig ib1 mtu 65520
    pre-up modprobe bond-ib
    pre-up /sbin/ifconfig bond-ib mtu 65520


OS is Ubuntu 14.04.3 LTS on the HP blade with Kernel 3.13.0-63-generic, and
Ubuntu 14.04.3 LTS with kernel 3.19.0-25-generic for the storage cluster.

The IB Mezzanine cards on the HP Blades are "InfiniBand: Mellanox
Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev
b0)". And on the storage cluster the IB ADPT are "Network controller:
Mellanox Technologies MT27500 Family [ConnectX-3]

Then I run in one of the nodes in cluster the '*ibqueryerrors*' command and
found the following:

$ ibqueryerrors
Errors for "e60-host01 HCA-1"  ---> blade1 the one with the bonding
configuration using internally HP-IB-SW port 17 and 25
   GUID 0xf452140300dd3296 port 2: [PortXmitWait == 15]
Errors for 0x2c902004b0918 "Infiniscale-IV Mellanox Technologies"
   GUID 0x2c902004b0918 port ALL: [PortXmitWait == 325727936]
   GUID 0x2c902004b0918 port 25: [PortXmitWait == 325727936]
Errors for 0xe41d2d030031e9c1 "MF0;GWIB01:SX6036G/U1"
   GUID 0xe41d2d030031e9c1 port ALL: [PortXmitWait == 326981305]
   GUID 0xe41d2d030031e9c1 port 11: [PortXmitWait == 326976642]
   GUID 0xe41d2d030031e9c1 port 36: [PortXmitWait == 4663]
Errors for 0xf45214030073f500 "MF0;SWIB02:SX6018/U1"
   GUID 0xf45214030073f500 port ALL: [PortXmitWait == 13979524]
   GUID 0xf45214030073f500 port 8: [PortXmitWait == 3749467]
   GUID 0xf45214030073f500 port 9: [PortXmitWait == 3434343]
   GUID 0xf45214030073f500 port 10: [PortXmitWait == 3389114]
   GUID 0xf45214030073f500 port 11: [PortXmitWait == 3406600]
Errors for 0xe41d2d030031eb41 "MF0;GWIB02:SX6036G/U1"
   GUID 0xe41d2d030031eb41 port ALL: [PortXmitWait == 1352]
   GUID 0xe41d2d030031eb41 port 34: [PortXmitWait == 1352]
Errors for "cibn08 HCA-1"
   GUID 0xe41d2d03007b77c1 port 1: [PortXmitWait == 813152781]
   GUID 0xe41d2d03007b77c2 port 2: [PortXmitWait == 3256286]
Errors for "cibn07 HCA-1"
   GUID 0xe41d2d03007b67c1 port 1: [PortXmitWait == 841850209]
   GUID 0xe41d2d03007b67c2 port 2: [PortXmitWait == 3211488]
Errors for "cibn05 HCA-1"
   GUID 0xe41d2d0300d95191 port 1: [PortXmitWait == 840576923]
   GUID 0xe41d2d0300d95192 port 2: [PortXmitWait == 2635901]
Errors for "cibn06 HCA-1"
   GUID 0xe41d2d03007b77b1 port 1: [PortXmitWait == 843231930]
   GUID 0xe41d2d03007b77b2 port 2: [PortXmitWait == 2869022]
Errors for 0xe41d2d0300097630 "MF0;SWIB01:SX6018/U1"
   GUID 0xe41d2d0300097630 port ALL: [PortXmitWait == 470746689]
   GUID 0xe41d2d0300097630 port 0: [PortXmitWait == 7]
   GUID 0xe41d2d0300097630 port 2: [PortXmitWait == 8046]
   GUID 0xe41d2d0300097630 port 3: [PortXmitWait == 7631]
   GUID 0xe41d2d0300097630 port 8: [PortXmitWait == 219608]
   GUID 0xe41d2d0300097630 port 9: [PortXmitWait == 216118]
   GUID 0xe41d2d0300097630 port 10: [PortXmitWait == 198693]
   GUID 0xe41d2d0300097630 port 11: [PortXmitWait == 206192]
   GUID 0xe41d2d0300097630 port 18: [PortXmitWait == 469890394]
Errors for "cibm01 HCA-1"
   GUID 0xe41d2d0300163651 port 1: [PortXmitWait == 6002]

## Summary: 22 nodes checked, 11 bad nodes found
##          208 ports checked, 26 ports have errors beyond threshold
##
## Suppressed:


$ ibportstate -L 29 17 query
Switch PortInfo:
# Port info: Lid 29 port 17
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................75
SMLid:...........................2328
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 29 DR path slid 4; dlid 65535; 0,17 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................32
SMLid:...........................2
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0

---

$ ibportstate -L 29 25 query
Switch PortInfo:
# Port info: Lid 29 port 25
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................75
SMLid:...........................2328
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 29 DR path slid 4; dlid 65535; 0,25 port 2
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................33
SMLid:...........................2
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0


First I thought that maybe some cables could be in a bad state, but.. all
of them?... so I really don't know if maybe this XmitWait could be pushing
some noise on the performance at all or not. Any ideas? or hints? Also I
had the SM configured on SWIB01 with high priority and then a second SM
configured on SWIB02 with less priority, both in an active state, is this
ok? Or is better to only have one and only one SM active at a time in the
entire IB network?

Also find below some iperf tests between blades that are on different
enclosures:

*e61-host01 (server):*

# iperf -s

*e60-host01 (client):*

# iperf -c 172.23.18.10 -P 4

------------------------------------------------------------
Client connecting to 172.23.18.10, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.23.18.1 port 52325 connected with 172.23.18.10 port 5001
[  4] local 172.23.18.1 port 52326 connected with 172.23.18.10 port 5001
[  5] local 172.23.18.1 port 52327 connected with 172.23.18.10 port 5001
[  6] local 172.23.18.1 port 52328 connected with 172.23.18.10 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  3.55 GBytes  3.05 Gbits/sec
[  6]  0.0-10.0 sec  3.02 GBytes  2.60 Gbits/sec
[  3]  0.0-10.0 sec  2.91 GBytes  2.50 Gbits/sec
[  5]  0.0-10.0 sec  2.75 GBytes  2.36 Gbits/sec
[SUM]  0.0-10.0 sec  12.2 GBytes  10.5 Gbits/sec

---

Now, between a storage cluster node and a blade:

*e60-host01 (server):*

# iperf -s

*cibn05 (client):*

# iperf -c 172.23.18.1 -P 4

------------------------------------------------------------
Client connecting to 172.23.18.1, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  6] local 172.23.17.5 port 34263 connected with 172.23.18.1 port 5001
[  4] local 172.23.17.5 port 34260 connected with 172.23.18.1 port 5001
[  5] local 172.23.17.5 port 34262 connected with 172.23.18.1 port 5001
[  3] local 172.23.17.5 port 34261 connected with 172.23.18.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 9.0 sec  3.80 GBytes  3.63 Gbits/sec
[  5]  0.0- 9.0 sec  3.78 GBytes  3.60 Gbits/sec
[  3]  0.0- 9.0 sec  3.78 GBytes  3.61 Gbits/sec
[  6]  0.0-10.0 sec  5.26 GBytes  4.52 Gbits/sec
[SUM]  0.0-10.0 sec  16.6 GBytes  14.3 Gbits/sec


Thanks in advance,

Best,


*German*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20151125/2b6de172/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Infiniband Diagram v1(1).pdf
Type: application/pdf
Size: 237019 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20151125/2b6de172/attachment.pdf>


More information about the Users mailing list