[libfabric-users] Fwd: Optimisation Tips for verbs provider

Mon Sep 12 17:51:23 PDT 2016

> I run the test and configured it in a such way that is similar to our
> client/server test. So first of all I made sure to never call
> bw_tx/rx_comp()

With native verbs, if you don't poll for completions, the QP will eventually overrun the CQ and halt.  I think the OFI verbs provider handles this case by returning EAGAIN if send queue is full.s

> <https://github.com/ofiwg/fabtests/blob/master/benchmarks/benchmark_sha
> red.c#L188> , setting the window size equal to the number of
> iterations. Then applied the following modifications to the shared.c
> file.
> 
> Here is my diff of shared.c:
> 
> 
> diff --git a/common/shared.c b/common/shared.c
> index 1709443..f93a209 100644
> --- a/common/shared.c
> +++ b/common/shared.c
> @@ -344,7 +344,7 @@ int ft_alloc_ep_res(struct fi_info *fi)
> 
> 
>         if (opts.options & FT_OPT_TX_CQ) {
>                 ft_cq_set_wait_attr();
> -               cq_attr.size = fi->tx_attr->size;
> +               cq_attr.size = 1;//fi->tx_attr->size;
>                 ret = fi_cq_open(domain, &cq_attr, &txcq, &txcq);
>                 if (ret) {
>                         FT_PRINTERR("fi_cq_open", ret);
> 
> @@ -363,7 +363,7 @@ int ft_alloc_ep_res(struct fi_info *fi)
> 
>         if (opts.options & FT_OPT_RX_CQ) {
>                 ft_cq_set_wait_attr();
> -               cq_attr.size = fi->rx_attr->size;
> +               cq_attr.size = 1;//fi->rx_attr->size;

This can result in overrunning the CQ quickly -- as early as the second send operation.  (The provider likely bumps this value up to something that aligns better to a page size.)

Given the structure of this benchmark, you will want the window size LESS THAN the size of the EP/QP and CQ.  Otherwise, the send queue will be full, but no completions will be removed from the CQ to allow additional completions to occur.

>                 ret = fi_cq_open(domain, &cq_attr, &rxcq, &rxcq);
>                 if (ret) {
>                         FT_PRINTERR("fi_cq_open", ret);
> 
> @@ -1224,9 +1224,12 @@ static int ft_spin_for_comp(struct fid_cq *cq,
> uint64_t *cur,
> 
>                 } else if (timeout >= 0) {
>                         clock_gettime(CLOCK_MONOTONIC, &b);
>                         if ((b.tv_sec - a.tv_sec) > timeout) {
> -                               fprintf(stderr, "%ds timeout
> expired\n", timeout);
> -                               return -FI_ENODATA;
> +                               //fprintf(stderr, "Total: %d %ds
> timeout expired\n", total, timeout);
> +                               return 0;//-FI_ENODATA;
>                         }
> +               } else if (ret == -FI_EAGAIN && timeout == -1) {
> +                       //fprintf(stdout, "Iter: %d\n", total);
> +                       return 0;
>                 }
>         }
> 
> The first two edits ensure to initialise the cqs with a depth of 1, the
> same of our client/server test. The third edit involves this function
> <https://github.com/ofiwg/fabtests/blob/master/common/shared.c#L1205> .
> The first else-if statement displayed in the third diff is the path of
> the server process in the case a fi_cq_read, called after a failed
> fi_recv (-FI_EAGAIN), returns -FI_EAGAIN:
> Returning zero form this function implies an immediate retry of a post
> receive operation, in other words we don't exit the while loop of the
> FT_POST macro.
> The second else-if statement, whose condition of activation is the same
> of an else, is the path chosen when:
> - ft_spin_for_comp is called with timeout equals to -1, i.e. in the
> client process
> - fi_cq_read, called after a failed fi_send (-FI_EAGAIN), returns -
> FI_EAGAIN.

Once either fi_send or fi_recv return FI_EAGAIN, you must poll for (and retrieve!) completions to ensure forward progress.

> With this modifications, the msg_bw test stays in the FT_POST macro
> until a post operation is successful.
> The send/receive loop of the test is reduced to the following:
> - post a work request
> - make a non blocking read on queue
> 
> This is what I get:
> 
> ./picotti/fabtests/bin/fi_msg_bw -t queue -c spin -f verbs -S
> $((1024*512)) -I 10000 -W 10000 -w 0 -s 10.23.4.166
> 
> bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
> 
> 
> 512k    10k     4.8g       16.58s    316.16    1658.28       0.00
> 
> bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
> 
> 512k    10k     4.8g        6.43s    815.20     643.14       0.00
> 
> bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
> 
> 512k    10k     4.8g       16.86s    311.04    1685.61       0.00
> 
> bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
> 
> 512k    10k     4.8g       13.15s    398.61    1315.30       0.00
> 
> 
> This results are aligned with the bandwidth of our client/server test.
> 
> 
> 
> 	Currently, there is an issue of unpredictable transfer times if
> the sender overruns the receiver.
> 
> 
> So I have reproduced this issue with the msg_bw test? Is a libfarbic
> issue?
> I don't understand how the bw_tx/rx_comp avoids this problem.

The bw tests rely on the window size being smaller than the size of the underlying queues.  Once that number of operations are in progress, it waits until it reads a completion before attempting to post more.

- Sean