[libfabric-users] Performance improvments with libfabric 1.6

Ilango, Arun arun.ilango at intel.com
Fri Aug 24 15:11:49 PDT 2018


I'm not aware of a single optimization which fixes the issue that you were seeing. A lot of changes went in between 1.4 and 1.6 and it's difficult to pinpoint to a particular commit. You may try doing a git bisect. On a cursory glance, the following commit seemed relevant but there may be others too.
7e6d98bd74470e0cb3c1277b66b1b17112c5d570

> Were there any changes on the completion mechanism that you are aware of?
One of the changes in this regard was, verbs provider now requests all posted sends to be signaled (IBV_SEND_SIGNALED) by default. Earlier it used to request signaling only occasionally.

Thanks,
Arun.

-----Original Message-----
From: Jörn Schumacher [mailto:jorn.schumacher at cern.ch] 
Sent: Thursday, August 23, 2018 2:36 AM
To: Ilango, Arun <arun.ilango at intel.com>; libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Performance improvments with libfabric 1.6

Hi Arun,

I am using FI_EP_MSG. I am using fi_send/fi_recv. I usually send relatively large buffers of around 1 MB, if that makes any difference.

Could you elaborate on the optimizations or maybe just point me to a commit that I could look at?

Were there any changes on the completion mechanism that you are aware of?


Thanks a lot for your help,

Jörn




On 08/22/2018 09:25 PM, Ilango, Arun wrote:
> Hi Jörn,
>
> What endpoint type are you using? For FI_EP_RDM type endpoints, the default path changed from internal verbs RDM support (v1.4) to the RxM utility provider (v1.6). For FI_EP_MSG, a few optimizations were added to improve latency in the send/recv path. v1.6 also has a memory registration cache but that should be turned off by default.
>
> What operations are used by your app (fi_inject/fi_send/fi_write/read, etc)?
>
> Thanks,
> Arun.
>
> -----Original Message-----
> From: Libfabric-users 
> [mailto:libfabric-users-bounces at lists.openfabrics.org] On Behalf Of 
> Jörn Schumacher
> Sent: Tuesday, August 21, 2018 2:18 AM
> To: libfabric-users at lists.openfabrics.org
> Subject: [libfabric-users] Performance improvments with libfabric 1.6
>
> Dear libfabric developers,
>
> I wrote already another message regarding libfabric 1.6 but this is a different topic so I open another thread to keep things separate.
>
> I recently updated from libfabric 1.4 to 1.6.1. I use the Verbs provider with 100G Ethernet+RoCE and 56G Infiniband. Queues are monitored with file descriptors and epoll. The situation improved significantly with 1.6, so this is not a bug report but rather I would like to get a deeper understanding of what changed and why it is affecting us so much.
>
>
> With libfabric 1.4 I had an issue that occasionally my application would run at very poor performance. Read: around 0.5 Gbps instead of 20+ Gbps, so quite a dramatic effect. Restart fixed it usually. This happens more often on the 100G Ethernet than the 56G IB, but is present on both.
>
> When this happens I see high CPU utilization, and a lot of CPU time spent in system calls. I suspect this having to do with epoll polling on the completion and event queues. This seems to be much better in libfabric 1.6 and I never saw the above issue.
>
>
>
> Second issue. This one is a bit odd as it involves the PCIe bus. In the PC we have apart from the NIC another custom-designed PCIe card [1].
> With libfabric 1.4, using the NIC seems to put a lot more pressure on the PCIe bus than with libfabric 1.6. We see that our custom card is delivering us corrupted data because it has to wait for the PCIe bus when the NIC is under load with libfabric 1.4, but it works just fine with libfabric 1.6.
>
>
> I would like to understand these issues better because it might indicate a deeper issue either in my code or in our custom card (and it's creating me some headache). So my question is, what are the crucial changes in 1.6 compared to 1.4 regarding the verbs provider? What changes can cause the change in PCIe utilization? The changelog is not very detailed in this regard.
>
> Any idea that could shed some light on these mysteries or just a clarification on the changes from 1.4 to 1.6 would greatly appreciated.
>
> Thanks all and sorry for the long read!
>
> Cheers,
> Jörn
>
>
> [1] http://cds.cern.ch/record/2229597
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users



More information about the Libfabric-users mailing list