[libfabric-users] Optimisation Tips for verbs provider
Valentino Picotti
valentino.picotti at gmail.com
Wed Sep 7 07:47:42 PDT 2016
Hi all,
I apologies in advance for the long email.
In the past month I've integrated libfabric in a project based on
infiniband verbs with the aim to be provider independent. This project has
a transport layer that makes the application independent from the transport
implementation (that is chosen at compile time).
I worked only on the libfabric implementation of the transport layer and
this was my first experience with RDMA APIs and hardware. What I did was to
map the various ibv_* and rdma_* calls to fi_* calls and I got a working
layer quite easily (after studying the libfabric terminology).
Now I'm trying to achieve the same performance of raw verbs.
I'm testing the transport layer with a one sided communication where a
client sends the data to a server with the message API(fi_send/fi_recv).
The client and the server run on two different nodes connected with one IB
EDR link: i don't set processor affinity nor change power management
policy. The depth of completion queues and the size of sent buffers are the
same across the tests.
Running on the verbs transport layer I get a stable bandwidth of 22 Gbps,
instead with libfabric over verbs I get a very floating bandwidth: from 0.4
Gbps to 19 Gbps in the same test[1]. The bandwidth is calculated as the
number of buffers sent every 5 seconds.
This is how i setup the verbs provider:
m_hints->caps = FI_MSG;
m_hints->mode = FI_LOCAL_MR;
m_hints->ep_attr->type = FI_EP_MSG;
m_hints->domain_attr->threading = FI_THREAD_COMPLETION;
m_hints->domain_attr->data_progress = FI_PROGRESS_MANUAL;
m_hints->domain_attr->resource_mgmt = FI_RM_DISABLED;
m_hints->fabric_attr->prov_name = strdup("verbs");
Furthermore I bind two completion queues to the endpoints: one with FI_SEND
flag and the other with FI_RECV.
I can't figure out why I'm getting that high variance with libfabric.
Do you have any idea? I'm missing same optimisations tips for the verbs
provider?
Thanks in advance,
Valentino
[1] Test run with depth queue of 1 and buffer size of 512KB
Example of a test output with libfarbic:
2016-09-07 - 15:10:56 t_server: INFO: Accepted connection
2016-09-07 - 15:10:56 t_server: INFO: Start receiving...
2016-09-07 - 15:11:01 t_server: INFO: Bandwith: 8.3324 Gb/s
2016-09-07 - 15:11:06 t_server: INFO: Bandwith: 15.831 Gb/s
2016-09-07 - 15:11:11 t_server: INFO: Bandwith: 19.1713 Gb/s
2016-09-07 - 15:11:16 t_server: INFO: Bandwith: 10.8825 Gb/s
2016-09-07 - 15:11:21 t_server: INFO: Bandwith: 8.07991 Gb/s
2016-09-07 - 15:11:26 t_server: INFO: Bandwith: 15.4015 Gb/s
2016-09-07 - 15:11:31 t_server: INFO: Bandwith: 20.4263 Gb/s
2016-09-07 - 15:11:36 t_server: INFO: Bandwith: 19.7023 Gb/s
2016-09-07 - 15:11:41 t_server: INFO: Bandwith: 10.474 Gb/s
2016-09-07 - 15:11:46 t_server: INFO: Bandwith: 17.4072 Gb/s
2016-09-07 - 15:11:51 t_server: INFO: Bandwith: 0.440402 Gb/s
2016-09-07 - 15:11:56 t_server: INFO: Bandwith: 2.73217 Gb/s
2016-09-07 - 15:12:01 t_server: INFO: Bandwith: 0.984822 Gb/s
2016-09-07 - 15:12:06 t_server: INFO: Bandwith: 2.93013 Gb/s
2016-09-07 - 15:12:11 t_server: INFO: Bandwith: 0.847248 Gb/s
2016-09-07 - 15:12:16 t_server: INFO: Bandwith: 7.72255 Gb/s
2016-09-07 - 15:12:21 t_server: INFO: Bandwith: 14.7849 Gb/s
2016-09-07 - 15:12:26 t_server: INFO: Bandwith: 12.9243 Gb/s
2016-09-07 - 15:12:31 t_server: INFO: Bandwith: 0.687027 Gb/s
2016-09-07 - 15:12:36 t_server: INFO: Bandwith: 1.44787 Gb/s
2016-09-07 - 15:12:41 t_server: INFO: Bandwith: 2.681 Gb/s
Example of a test output with raw verbs:
2016-09-07 - 16:36:00 t_server: INFO: Accepted connection
2016-09-07 - 16:36:00 t_server: INFO: Start receiving...
2016-09-07 - 16:36:05 t_server: INFO: Bandwith: 17.9491 Gb/s
2016-09-07 - 16:36:10 t_server: INFO: Bandwith: 23.4671 Gb/s
2016-09-07 - 16:36:15 t_server: INFO: Bandwith: 23.0368 Gb/s
2016-09-07 - 16:36:20 t_server: INFO: Bandwith: 22.9638 Gb/s
2016-09-07 - 16:36:25 t_server: INFO: Bandwith: 22.8203 Gb/s
2016-09-07 - 16:36:30 t_server: INFO: Bandwith: 20.058 Gb/s
2016-09-07 - 16:36:35 t_server: INFO: Bandwith: 22.5033 Gb/s
2016-09-07 - 16:36:40 t_server: INFO: Bandwith: 20.1754 Gb/s
2016-09-07 - 16:36:45 t_server: INFO: Bandwith: 22.5578 Gb/s
2016-09-07 - 16:36:50 t_server: INFO: Bandwith: 20.0588 Gb/s
2016-09-07 - 16:36:55 t_server: INFO: Bandwith: 22.2718 Gb/s
2016-09-07 - 16:37:00 t_server: INFO: Bandwith: 22.494 Gb/s
2016-09-07 - 16:37:05 t_server: INFO: Bandwith: 23.1836 Gb/s
2016-09-07 - 16:37:10 t_server: INFO: Bandwith: 23.0972 Gb/s
2016-09-07 - 16:37:15 t_server: INFO: Bandwith: 21.5033 Gb/s
2016-09-07 - 16:37:20 t_server: INFO: Bandwith: 18.5506 Gb/s
2016-09-07 - 16:37:25 t_server: INFO: Bandwith: 20.3709 Gb/s
2016-09-07 - 16:37:30 t_server: INFO: Bandwith: 21.3457 Gb/s
2016-09-07 - 16:37:35 t_server: INFO: Bandwith: 20.5059 Gb/s
2016-09-07 - 16:37:40 t_server: INFO: Bandwith: 22.4899 Gb/s
2016-09-07 - 16:37:45 t_server: INFO: Bandwith: 22.1266 Gb/s
2016-09-07 - 16:37:50 t_server: INFO: Bandwith: 22.4504 Gb/s
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20160907/58bf293a/attachment.html>
More information about the Libfabric-users
mailing list