[libfabric-users] Fwd: Optimisation Tips for verbs provider

Mon Sep 12 06:53:03 PDT 2016

Thanks for the reply Arun,

You can run the fi_msg_bw test with other completion-waiting options like
> sread and fd too. I don’t get any variance when using them as well.
>
There is no difference between cq_sread and spin on cq_read since the verbs
cq_sread implementation spins on cq_read.

The test calls fi_cq_read after posting every window size number of
> sends/recvs. The window size is an adjustable parameter. You can view all
> available options by calling fi_msg_bw -h.
>
May be you could try posting a bunch of sends/recvs and then collect the
> completions for the bunch. Is there a need to post the messages one by one?
> If that’s the case please try using spin wait for getting the completion.
> But even that might not guarantee consistent numbers.
>

I run the test and configured it in a such way that is similar to our
client/server test. So first of all I made sure to never call
bw_tx/rx_comp()
<https://github.com/ofiwg/fabtests/blob/master/benchmarks/benchmark_shared.c#L188>,
setting the window size equal to the number of iterations. Then applied the
following modifications to the shared.c file.

Here is my diff of shared.c:

diff --git a/common/shared.c b/common/shared.c
index 1709443..f93a209 100644
--- a/common/shared.c
+++ b/common/shared.c
@@ -344,7 +344,7 @@ int ft_alloc_ep_res(struct fi_info *fi)

        if (opts.options & FT_OPT_TX_CQ) {
                ft_cq_set_wait_attr();
-               cq_attr.size = fi->tx_attr->size;
+               cq_attr.size = 1;//fi->tx_attr->size;
                ret = fi_cq_open(domain, &cq_attr, &txcq, &txcq);
                if (ret) {
                        FT_PRINTERR("fi_cq_open", ret);

@@ -363,7 +363,7 @@ int ft_alloc_ep_res(struct fi_info *fi)

        if (opts.options & FT_OPT_RX_CQ) {
                ft_cq_set_wait_attr();
-               cq_attr.size = fi->rx_attr->size;
+               cq_attr.size = 1;//fi->rx_attr->size;
                ret = fi_cq_open(domain, &cq_attr, &rxcq, &rxcq);
                if (ret) {
                        FT_PRINTERR("fi_cq_open", ret);

@@ -1224,9 +1224,12 @@ static int ft_spin_for_comp(struct fid_cq *cq,
uint64_t *cur,
                } else if (timeout >= 0) {
                        clock_gettime(CLOCK_MONOTONIC, &b);
                        if ((b.tv_sec - a.tv_sec) > timeout) {
-                               fprintf(stderr, "%ds timeout expired\n",
timeout);
-                               return -FI_ENODATA;
+                               //fprintf(stderr, "Total: %d %ds timeout
expired\n", total, timeout);
+                               return 0;//-FI_ENODATA;
                        }
+               } else if (ret == -FI_EAGAIN && timeout == -1) {
+                       //fprintf(stdout, "Iter: %d\n", total);
+                       return 0;
                }
        }

The first two edits ensure to initialise the cqs with a depth of 1, the
same of our client/server test. The third edit involves this function
<https://github.com/ofiwg/fabtests/blob/master/common/shared.c#L1205>.
The first else-if statement displayed in the third diff is the path of the
server process in the case a fi_cq_read, called after a failed fi_recv
(-FI_EAGAIN), returns -FI_EAGAIN:
Returning zero form this function implies an immediate retry of a post
receive operation, in other words we don't exit the while loop of the
FT_POST macro.
The second else-if statement, whose condition of activation is the same of
an else, is the path chosen when:
- ft_spin_for_comp is called with timeout equals to -1, i.e. in the client
process
- fi_cq_read, called after a failed fi_send (-FI_EAGAIN), returns
-FI_EAGAIN.

With this modifications, the msg_bw test stays in the FT_POST macro until a
post operation is successful.
The send/receive loop of the test is reduced to the following:
- post a work request
- make a non blocking read on queue

This is what I get:

./picotti/fabtests/bin/fi_msg_bw -t queue -c spin -f verbs -S $((1024*512))
-I 10000 -W 10000 -w 0 -s 10.23.4.166

bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec

512k    10k     4.8g       16.58s    316.16    1658.28       0.00

bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec

512k    10k     4.8g        6.43s    815.20     643.14       0.00

bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec

512k    10k     4.8g       16.86s    311.04    1685.61       0.00

bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec

512k    10k     4.8g       13.15s    398.61    1315.30       0.00

This results are aligned with the bandwidth of our client/server test.



> Currently, there is an issue of unpredictable transfer times if the sender
> overruns the receiver.
>

So I have reproduced this issue with the msg_bw test? Is a libfarbic issue?
I don't understand how the bw_tx/rx_comp avoids this problem.


-Arun.
>

Thanks Arun,

Valentino.


>
> *From:* Libfabric-users [mailto:libfabric-users-bounce
> s at lists.openfabrics.org] *On Behalf Of *Valentino Picotti
> *Sent:* Thursday, September 08, 2016 5:12 AM
> *To:* libfabric-users at lists.openfabrics.org
> *Subject:* [libfabric-users] Fwd: Optimisation Tips for verbs provider
>
>
>
> I forgot to CC the list, here is my reply to Arun:
>
>
>
> ---------- Forwarded message ----------
> From: *Valentino Picotti* <valentino.picotti at gmail.com>
> Date: 8 September 2016 at 14:00
> Subject: Re: [libfabric-users] Optimisation Tips for verbs provider
> To: "Ilango, Arun" <arun.ilango at intel.com>
>
> Thanks for the reply,
>
>
>
> I run the fi_msg_bw with CQ size and window size of 1 and I got the
> following result:
>
> bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
>
> 512k    1m      488g      189.50s   2766.76     189.50       0.01
>
> 21,61 Gbps is an excellent result. I'm using libfabric 1.3.0 from the
> latest tarball.
>
>
>
> So the problem is in my transport layer.
>
> My fabric initialisation doesn't differs too much from the fi_msg_bw one,
> so the problem might be in the main loop.
>
> At a first glance, It seems that i call fi_cq_read less often than the bw
> test.
>
> In the test the sequence is:
>
> - post work request ft_post_tx/rx
>
> - spin on a completion bw_(tx/rx)_comp
>
>
>
> In my client/server main loop:
>
> - call fi_cq_read
>
> - post work request
>
>
>
> I don't spin waiting for completions, could this be the reason?
>
>
>
> Thanks,
>
> Valentino
>
>
>
>
>
> On 7 September 2016 at 19:17, Ilango, Arun <arun.ilango at intel.com> wrote:
>
> Hi Valentino,
>
>
>
> Libfabric has a set of tests available at https://github.com/ofiwg/fabte
> sts. Can you run the fi_msg_bw test with the same size and iterations on
> your setup and check if you notice any variance? Also what version/commit
> number of libfabric are you using?
>
>
>
> Thanks,
>
> Arun.
>
>
>
> *From:* Libfabric-users [mailto:libfabric-users-bounce
> s at lists.openfabrics.org] *On Behalf Of *Valentino Picotti
> *Sent:* Wednesday, September 07, 2016 7:48 AM
> *To:* libfabric-users at lists.openfabrics.org
> *Subject:* [libfabric-users] Optimisation Tips for verbs provider
>
>
>
> Hi all,
>
>
>
> I apologies in advance for the long email.
>
>
>
> In the past month I've integrated libfabric in a project based on
> infiniband verbs with the aim to be provider independent. This project has
> a transport layer that makes the application independent from the transport
> implementation (that is chosen at compile time).
>
> I worked only on the libfabric implementation of the transport layer and
> this was my first experience with RDMA APIs and hardware. What I did was to
> map the various ibv_* and rdma_* calls to fi_* calls and I got a working
> layer quite easily (after studying the libfabric terminology).
>
> Now I'm trying to achieve the same performance of raw verbs.
>
> I'm testing the transport layer with a one sided communication where a
> client sends the data to a server with the message API(fi_send/fi_recv).
> The client and the server run on two different nodes connected with one IB
> EDR link: i don't set processor affinity nor change power management
> policy. The depth of completion queues and the size of sent buffers are the
> same across the tests.
>
> Running on the verbs transport layer I get a stable bandwidth of 22 Gbps,
> instead with libfabric over verbs I get a very floating bandwidth: from 0.4
> Gbps to 19 Gbps in the same test[1]. The bandwidth is calculated as the
> number of buffers sent every 5 seconds.
>
>
>
> This is how i setup the verbs provider:
>
>
>   m_hints->caps = FI_MSG;
>   m_hints->mode = FI_LOCAL_MR;
>   m_hints->ep_attr->type = FI_EP_MSG;
>   m_hints->domain_attr->threading = FI_THREAD_COMPLETION;
>   m_hints->domain_attr->data_progress = FI_PROGRESS_MANUAL;
>   m_hints->domain_attr->resource_mgmt = FI_RM_DISABLED;
>   m_hints->fabric_attr->prov_name = strdup("verbs");
>
>
>
> Furthermore I bind two completion queues to the endpoints: one with
> FI_SEND flag and the other with FI_RECV.
>
>
>
> I can't figure out why I'm getting that high variance with libfabric.
>
> Do you have any idea? I'm missing same optimisations tips for the verbs
> provider?
>
>
>
> Thanks in advance,
>
>
>
> Valentino
>
>
>
>
>
> [1] Test run with depth queue of 1 and buffer size of 512KB
>
>
>
> Example of a test output with libfarbic:
>
> 2016-09-07 - 15:10:56 t_server: INFO: Accepted connection
>
> 2016-09-07 - 15:10:56 t_server: INFO: Start receiving...
>
> 2016-09-07 - 15:11:01 t_server: INFO: Bandwith: 8.3324 Gb/s
>
> 2016-09-07 - 15:11:06 t_server: INFO: Bandwith: 15.831 Gb/s
>
> 2016-09-07 - 15:11:11 t_server: INFO: Bandwith: 19.1713 Gb/s
>
> 2016-09-07 - 15:11:16 t_server: INFO: Bandwith: 10.8825 Gb/s
>
> 2016-09-07 - 15:11:21 t_server: INFO: Bandwith: 8.07991 Gb/s
>
> 2016-09-07 - 15:11:26 t_server: INFO: Bandwith: 15.4015 Gb/s
>
> 2016-09-07 - 15:11:31 t_server: INFO: Bandwith: 20.4263 Gb/s
>
> 2016-09-07 - 15:11:36 t_server: INFO: Bandwith: 19.7023 Gb/s
>
> 2016-09-07 - 15:11:41 t_server: INFO: Bandwith: 10.474 Gb/s
>
> 2016-09-07 - 15:11:46 t_server: INFO: Bandwith: 17.4072 Gb/s
>
> 2016-09-07 - 15:11:51 t_server: INFO: Bandwith: 0.440402 Gb/s
>
> 2016-09-07 - 15:11:56 t_server: INFO: Bandwith: 2.73217 Gb/s
>
> 2016-09-07 - 15:12:01 t_server: INFO: Bandwith: 0.984822 Gb/s
>
> 2016-09-07 - 15:12:06 t_server: INFO: Bandwith: 2.93013 Gb/s
>
> 2016-09-07 - 15:12:11 t_server: INFO: Bandwith: 0.847248 Gb/s
>
> 2016-09-07 - 15:12:16 t_server: INFO: Bandwith: 7.72255 Gb/s
>
> 2016-09-07 - 15:12:21 t_server: INFO: Bandwith: 14.7849 Gb/s
>
> 2016-09-07 - 15:12:26 t_server: INFO: Bandwith: 12.9243 Gb/s
>
> 2016-09-07 - 15:12:31 t_server: INFO: Bandwith: 0.687027 Gb/s
>
> 2016-09-07 - 15:12:36 t_server: INFO: Bandwith: 1.44787 Gb/s
>
> 2016-09-07 - 15:12:41 t_server: INFO: Bandwith: 2.681 Gb/s
>
>
>
> Example of a test output with raw verbs:
>
> 2016-09-07 - 16:36:00 t_server: INFO: Accepted connection
>
> 2016-09-07 - 16:36:00 t_server: INFO: Start receiving...
>
> 2016-09-07 - 16:36:05 t_server: INFO: Bandwith: 17.9491 Gb/s
>
> 2016-09-07 - 16:36:10 t_server: INFO: Bandwith: 23.4671 Gb/s
>
> 2016-09-07 - 16:36:15 t_server: INFO: Bandwith: 23.0368 Gb/s
>
> 2016-09-07 - 16:36:20 t_server: INFO: Bandwith: 22.9638 Gb/s
>
> 2016-09-07 - 16:36:25 t_server: INFO: Bandwith: 22.8203 Gb/s
>
> 2016-09-07 - 16:36:30 t_server: INFO: Bandwith: 20.058 Gb/s
>
> 2016-09-07 - 16:36:35 t_server: INFO: Bandwith: 22.5033 Gb/s
>
> 2016-09-07 - 16:36:40 t_server: INFO: Bandwith: 20.1754 Gb/s
>
> 2016-09-07 - 16:36:45 t_server: INFO: Bandwith: 22.5578 Gb/s
>
> 2016-09-07 - 16:36:50 t_server: INFO: Bandwith: 20.0588 Gb/s
>
> 2016-09-07 - 16:36:55 t_server: INFO: Bandwith: 22.2718 Gb/s
>
> 2016-09-07 - 16:37:00 t_server: INFO: Bandwith: 22.494 Gb/s
>
> 2016-09-07 - 16:37:05 t_server: INFO: Bandwith: 23.1836 Gb/s
>
> 2016-09-07 - 16:37:10 t_server: INFO: Bandwith: 23.0972 Gb/s
>
> 2016-09-07 - 16:37:15 t_server: INFO: Bandwith: 21.5033 Gb/s
>
> 2016-09-07 - 16:37:20 t_server: INFO: Bandwith: 18.5506 Gb/s
>
> 2016-09-07 - 16:37:25 t_server: INFO: Bandwith: 20.3709 Gb/s
>
> 2016-09-07 - 16:37:30 t_server: INFO: Bandwith: 21.3457 Gb/s
>
> 2016-09-07 - 16:37:35 t_server: INFO: Bandwith: 20.5059 Gb/s
>
> 2016-09-07 - 16:37:40 t_server: INFO: Bandwith: 22.4899 Gb/s
>
> 2016-09-07 - 16:37:45 t_server: INFO: Bandwith: 22.1266 Gb/s
>
> 2016-09-07 - 16:37:50 t_server: INFO: Bandwith: 22.4504 Gb/s
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20160912/8bfb42af/attachment.html>