[ofa-general] rping / librdmacm deadlock question

Clark Tucker clark.tucker at gmail.com
Wed Jul 18 10:42:11 PDT 2007


Hello all,

First, the background: I am writing a linux device driver to provide IWarp
device support for our hardware.  I'm currently running kernel 2.6.20-rc4and
OFED-1.2-rc2.  I realize these are somewhat old, but I have examined newer
source, and haven't found any changes that seem immediately relevant.

I am experiencing the following behavior:

rping -s ....  (server starts fine, loads proper user-space library, etc)

rping -c ... (client starts fine, ... connects to server, and exchanges data
successfully)
So far so good.

If I interrupt the rping client with CTRL-C, then the client hangs hard.

I have, I believe, traced this to a deadlock between ib_destroy_qp() and
ucma_close(). It looks like librdmacm has a ((destructor)) function defined
that results in a call to ibv_device_close() and ultimately in
<device>::destroy_qp().   That seems reasonable, and it all happens as the
OS unloads the application.

However, it is (I believe) happening before the "rdma_cm" device file
descriptor is 'closed' by the OS as the application terminates.
[rdma_destroy_event_channel() would normally do this, but it doesn't get
called when the application is interrupted by SIGINT.]

Our driver (as do all drivers I've seen) performs an
atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 'destroy_qp()'.
Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't yet
been called), a cm_id still has an active reference to the qp, and the
wait_event() will end up 'wait'ing.

So, the application cleanup process is blocked, essentially waiting for
kernel::ucma_close() to be called ... which won't happen because the
application unload code is blocked in destroy_qp()  ==> deadlock.

First, does my analysis make sense?

Perhaps my device driver should do additional work in ib_destroy_qp() that
will trigger the destruction of the cm_id... [but that doesn't seem
consistent with other drivers I've seen.]

Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm"
device is closed before calling ibv_device_close()?

I'm just not sure if this is a driver issue, an application issue, or
something in between.
Also, I don't have access to any other IWarp hardware, so I can't test this
scenario in a different environment...

Any help/advice would be greatly appreciated!

Thanks for your time,
--Clark Tucker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/4a615cfb/attachment.html>


More information about the general mailing list