[libfabric-users] verbs WAIT_FD signalling at all times
Carlo Alberto Gottardo
carlo.gottardo at cern.ch
Wed Jun 30 08:31:09 PDT 2021
Dear Sean,
thank you for your previous message.
I followed your advice and I created a structure that keeps track of all the completion queues and their fid.
Then I wrapped epoll_wait in fi_trywait
while(1){
if( fi_trywait ( evloop->pfids.fabric, evloop->pfids.fid_set, evloop->pfids.count ) == FI_SUCCESS){
nevents = epoll_wait(evloop->epollfd, evloop->events, MAX_EPOLL_EVENTS, EPOLL_TIMEOUT);
}
...
for(int i=0; i<nevents; i++){ process_event(...
}
however fi_trywait calls, via vrb_cq_trywait, ibv_get_cq_event which seems to be blocking (i.e. killing the program with GDB I see that the program is stuck there).
Following [1], I tried to make non-blocking by setting the O_NONBLOCK flag to the FD prior to adding it to epoll but this had no effect.
My application is based on a single event loop (the epoll) running on a single thread and executing non-blocking functions so I wouldn't expect polling to happen elsewhere.
Is there something wrong in what I am doing?
Thank you very much for your help,
Carlo
[1] https://www.rdmamojo.com/2013/03/09/ibv_get_cq_event/
_____________________
Carlo A. Gottardo
Postdoc at Nikhef
On 29 Jun 2021, at 18:53, Hefty, Sean <sean.hefty at intel.com<mailto:sean.hefty at intel.com>> wrote:
Background:
With verbs devices there are 2 queues in play here. The first is associated with the fd, where low-level events are in the kernel. This event is generated in response to an interrupt from the device. In order to limit how many interrupts the device generates, the device must be manually reset before it will generate another event. The event indicates which CQ had entries added to it.
The second queue contains the completion entries itself. That is what fi_cq_read is accessing.
The reason the fd remains signaled is that the kernel event is never being read.
If an application wants to wait directly on a wait object (fd) using OS specific calls (select/poll), it needs to call fi_trywait() prior to blocking.
https://ofiwg.github.io/libfabric/v1.12.1/man/fi_poll.3.html
- Sean
using the verbs provider I use the FI_WAIT_FD wait object for the completion queue
(CQ).
The resulting file descriptor (FD), associated to the libibverbs completion channel, is
added to an epoll on waiting for EPOLLIN.
The FD signal triggers the callback where fi_cq_read reads the completions in a non-
blocking way.
The problem is that, as soon as a connection is established, the FD keeps signalling as
fast as the CPU allows for, even if there is no data transfer.
As a matter of fact, after the first call, fi_cq_read keeps returning EAGAIN, sign
there there is no CQ entry to read.
I would expect the fd not to signal if there's nothing to read in the CQ. Shouldn't be
this the case?
I read some libibverbs documentation and this point still remains unclear to me.
Below I post the CQ attributes and some function calls.
Thank you very much for your help,
Carlo
_____________________
Carlo A. Gottardo
Postdoc at Nikhef
Skype: carlogottardo
System: Libfabric 1.12.1 / Centos7 / Mellanox Connect-X5
CQ attributes
struct fi_cq_attr cq_attr;
cq_attr.size = MAX_CQ_ENTRIES;
cq_attr.flags = 0;
cq_attr.format = FI_CQ_FORMAT_DATA;
cq_attr.wait_obj= FI_WAIT_FD;
cq_attr.signaling_vector = 0;
cq_attr.wait_cond = FI_CQ_COND_NONE;
cq_attr.wait_set = NULL;
the queue is open and bonded with
fi_cq_open(rsocket->domain, &cq_attr, &rsocket->cq, NULL)))
fi_ep_bind((rsocket->ep), &rsocket->cq->fid, FI_TRANSMIT|FI_RECV)))
when the connection is established the wait object is retrieved with
fi_control(&socket->cq->fid, FI_GETWAIT, &socket->cqfd)
the file descriptor is assigned a callback
socket->cq_ev_ctx.fd = socket->cqfd;
socket->cq_ev_ctx.data = socket;
socket->cq_ev_ctx.cb = on_recv_socket_cq_event;
and finally the file descriptor is added to the the main and only epoll of the
application, which waits for EPOLLIN.
In the on_recv_socket_cq_event callback:
fi_cq_read(socket->cq, &completion_entries, N);
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210630/b1f6e839/attachment.htm>
More information about the Libfabric-users
mailing list