[libfabric-users] Performance improvments with libfabric 1.6
joern.schumacher at cern.ch
Tue Aug 21 02:17:47 PDT 2018
Dear libfabric developers,
I wrote already another message regarding libfabric 1.6 but this is a
different topic so I open another thread to keep things separate.
I recently updated from libfabric 1.4 to 1.6.1. I use the Verbs provider
with 100G Ethernet+RoCE and 56G Infiniband. Queues are monitored with
file descriptors and epoll. The situation improved significantly with
1.6, so this is not a bug report but rather I would like to get a deeper
understanding of what changed and why it is affecting us so much.
With libfabric 1.4 I had an issue that occasionally my application would
run at very poor performance. Read: around 0.5 Gbps instead of 20+ Gbps,
so quite a dramatic effect. Restart fixed it usually. This happens more
often on the 100G Ethernet than the 56G IB, but is present on both.
When this happens I see high CPU utilization, and a lot of CPU time
spent in system calls. I suspect this having to do with epoll polling on
the completion and event queues. This seems to be much better in
libfabric 1.6 and I never saw the above issue.
Second issue. This one is a bit odd as it involves the PCIe bus. In the
PC we have apart from the NIC another custom-designed PCIe card .
With libfabric 1.4, using the NIC seems to put a lot more pressure on
the PCIe bus than with libfabric 1.6. We see that our custom card is
delivering us corrupted data because it has to wait for the PCIe bus
when the NIC is under load with libfabric 1.4, but it works just fine
with libfabric 1.6.
I would like to understand these issues better because it might indicate
a deeper issue either in my code or in our custom card (and it's
creating me some headache). So my question is, what are the crucial
changes in 1.6 compared to 1.4 regarding the verbs provider? What
changes can cause the change in PCIe utilization? The changelog is not
very detailed in this regard.
Any idea that could shed some light on these mysteries or just a
clarification on the changes from 1.4 to 1.6 would greatly appreciated.
Thanks all and sorry for the long read!
More information about the Libfabric-users