[ofa-general] Help - RDMA event files remain open after acknowledging them

Nitin Mehrotra nmehrotra at riorey.com
Fri Aug 14 08:42:42 PDT 2009


Folks,

This may be a newbie question but I can't seem to find the answer and I'm hoping someone can point me in the right direction.

I'm building an IB application where the two ends are required to robustly connect when present. Either of the ends may fail for extended periods of time and the other needs to handle this and reconnect when the peer recovers. The server is trivial since it passively listens for connections but the client is giving me some trouble.

I have used a model similar to the one described in http://linux.die.net/man/7/rdma_cm. The general connection flow on the client is  rdma_create_id/rdma_resolve_addr/rdma_create_qp/rdma_resolve_route/rdma_connect, handling the events as appropriate. This works when the peer (server) is present. However when the server is not present I have observed that rdma_resolve_addr and rdma_resolve_route succeed (since the local HCA and SM are present) and then I get a RDMA_CM_EVENT_REJECTED or a RDMA_CM_EVENT_UNREACHABLE event. At this point I delete the IB resources allocated between steps 1 & 2 (QP, CQE, CQ, etc) and restart the rdma_resolve_addr. As an aside, I found that I could not just restart rdma_resolve_route - that returned error EINVAL, I had to restart from rdma_resolve_addr.

The problem I am facing is that it appears that every RDMA event I receive (from uverbs it appears) creates a special file that is linked to "infinibandevent:". See below. However even though I am careful to acknowledge every RDMA event I receive (rdma_ack_cm_event for every rdma_ack_cm_event) these files don't get closed or deleted so that eventually the application fails with error EMFILE (too many open files) when trying to create the completion event queue (as part of creating the QP).

What am I doing wrong? Is there something more I need to do than calling rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be closed? As an fyi, I have even tried closing the rdma_id and destroying the event channel when the connection fails to force the event files to be closed without success.

Btw, this is a user space application and I am using OFED 1.4.1 on Linux 2.6.27 (gentoo distro). It should be irrelevant but just in case, this is using a Mellanox HCA, both peers are on a local subnet with only one IB interface per peer.

Thanks,

Nitin

filter-1 ib # ls -l /proc/8072/fd
total 0
lrwx------ 1 root root 64 Aug 14 06:44 0 -> /dev/pts/0
lrwx------ 1 root root 64 Aug 14 06:44 1 -> /dev/pts/0
lr-x------ 1 root root 64 Aug 14 06:44 10 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 11 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 12 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 13 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 14 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 15 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 16 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 17 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 18 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 19 -> infinibandevent:
lrwx------ 1 root root 64 Aug 14 06:44 2 -> /dev/pts/0
lr-x------ 1 root root 64 Aug 14 06:44 20 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 21 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 22 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 23 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 24 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 25 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 26 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 27 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 28 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 29 -> infinibandevent:
lrwx------ 1 root root 64 Aug 14 06:44 3 -> socket:[223603]
lr-x------ 1 root root 64 Aug 14 06:44 30 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 31 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 32 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 33 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 34 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 35 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 36 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 37 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 38 -> infinibandevent:
lr-x------ 1 root root 64 Aug 14 06:44 39 -> infinibandevent:

These grow until 999 files and then the app fails.



More information about the general mailing list