[ewg] retry_cnt, rnr_retry and ibv_get_cq_event/ibv_req_notify_cq

Mon Apr 19 07:59:12 PDT 2010

Hello Roland

On Mon, Apr 19, 2010 at 4:16 PM, Roland Dreier <rdreier at cisco.com> wrote:
>  > If retry_cnt and rnr_retry are set to 7 (toggle the RETRY_COUNT
>  > #define at the top of the example code) as lots of documentation and
>  > examples suggest, I never get any events in my CQ. Setting it to
>  > another value, e.g. 6 causes the appropriate WCs to be generated.
> an rnr_retry of 7 means unlimited RNR retries -- ie if no receive is
> available at the receiver, then the sender will continue trying forever.
> For full details, you can look at the RNR NAK mechanism in the IB spec.

Ah, I suspected it was something like that. I'll check the spec.

>  > Furthermore, I was wondering if there is a way to use
>  > ibv_get_cq_event/ibv_req_notify_cq to do a non-blocking poll for CQ
>  > events? I have the option in the attached code, but I've found that I
>  > miss some CQ events, for example, when posting 100 sends, I will only
>  > get back about 3--5 WCs containing error statuses. If I poll, I get
>  > all the WCs.
> There seems to be some confusion here between CQ events (which are
> single-shot notifications generated when a completion is enqueued into
> an "armed" CQ -- if another completion is added before notification is
> requested again, then another event is not generated).
> So for 100 work requests, you may not always get 100 CQ events, but you
> should always get 100 work completions enqueued on the CQ eventually.

What still seems a bit tricky to me is exactly how many events to poll
for while avoiding blocking in ibv_poll_cq.

Will num_entries=size of CQ be sufficient? It seems to me that it
could be, even in the case of multiple QPs, since if your CQ overflows
you are in trouble anyway, but consider the following sequence:

In initialization of CQ:

completion channel fd -> nonblocking
ibv_req_notify_cq

then when WCs start happening:

select/poll/whatever success
ibv_get_cq_event
// WCs arrive here will completion channel is disarmed
ibv_req_notify_cq
// WCs can arrive here, all of which will be also be picked up by the
next ibv_poll_cq
n = ibv_poll_cq(num_entries=size of CQ)

select/poll/whatever success
ibv_get_cq_event
ibv_req_notify_cq
n = ibv_poll_cq(num_entries=size of CQ)

Now the last ibv_poll_cq blocks because we drained the CQ on the previous call.

Any thoughts?

> In any case, it is certainly fine to do poll(), epoll, SIGIO, whatever
> on the completion channel file descriptor.  You can do fcntl to set it
> to nonblocking mode too.

Thanks.

Regards

Albert