[libfabric-users] Performance improvments with libfabric 1.6

Jörn Schumacher joern.schumacher at cern.ch
Tue Aug 21 02:17:47 PDT 2018


Dear libfabric developers,

I wrote already another message regarding libfabric 1.6 but this is a 
different topic so I open another thread to keep things separate.

I recently updated from libfabric 1.4 to 1.6.1. I use the Verbs provider 
with 100G Ethernet+RoCE and 56G Infiniband. Queues are monitored with 
file descriptors and epoll. The situation improved significantly with 
1.6, so this is not a bug report but rather I would like to get a deeper 
understanding of what changed and why it is affecting us so much.


With libfabric 1.4 I had an issue that occasionally my application would 
run at very poor performance. Read: around 0.5 Gbps instead of 20+ Gbps, 
so quite a dramatic effect. Restart fixed it usually. This happens more 
often on the 100G Ethernet than the 56G IB, but is present on both.

When this happens I see high CPU utilization, and a lot of CPU time 
spent in system calls. I suspect this having to do with epoll polling on 
the completion and event queues. This seems to be much better in 
libfabric 1.6 and I never saw the above issue.



Second issue. This one is a bit odd as it involves the PCIe bus. In the 
PC we have apart from the NIC another custom-designed PCIe card [1]. 
With libfabric 1.4, using the NIC seems to put a lot more pressure on 
the PCIe bus than with libfabric 1.6. We see that our custom card is 
delivering us corrupted data because it has to wait for the PCIe bus 
when the NIC is under load with libfabric 1.4, but it works just fine 
with libfabric 1.6.


I would like to understand these issues better because it might indicate 
a deeper issue either in my code or in our custom card (and it's 
creating me some headache). So my question is, what are the crucial 
changes in 1.6 compared to 1.4 regarding the verbs provider? What 
changes can cause the change in PCIe utilization? The changelog is not 
very detailed in this regard.

Any idea that could shed some light on these mysteries or just a 
clarification on the changes from 1.4 to 1.6 would greatly appreciated.

Thanks all and sorry for the long read!

Cheers,
Jörn


[1] http://cds.cern.ch/record/2229597


More information about the Libfabric-users mailing list