Hello all, <br><br>First, the background: I am writing a linux device driver to provide
IWarp device support for our hardware. I'm currently running kernel
2.6.20-rc4 and OFED-1.2-rc2. I realize these are somewhat old, but I
have examined newer source, and haven't found any changes that seem
immediately relevant.
<br>
<br>I am experiencing the following behavior:
<br>
<br>rping -s .... (server starts fine, loads proper user-space library, etc)
<br>
<br>rping -c ... (client starts fine, ... connects to server, and exchanges
data successfully)
<br>So far so good.
<br>
<br>If I interrupt the rping client with CTRL-C, then the client hangs hard.
<br><br>I have, I believe, traced this to a deadlock between ib_destroy_qp() and
ucma_close().
It looks like librdmacm has a ((destructor)) function defined that
results in a call to ibv_device_close() and ultimately in
<device>::destroy_qp(). That seems reasonable, and it all happens as
the OS unloads the application.
<br><br>However, it is (I believe) happening before the "rdma_cm" device file
descriptor is 'closed' by the OS as the application terminates.
[rdma_destroy_event_channel() would normally do this, but it doesn't get
called when the application is interrupted by SIGINT.]
<br>
<br>Our driver (as do all drivers I've seen) performs an
atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 'destroy_qp()'.
Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't
yet been called), a cm_id still has an active reference to the qp, and
the wait_event() will end up 'wait'ing.
<br>
<br>So, the application cleanup process is blocked, essentially waiting for
kernel::ucma_close() to be called ... which won't happen because the
application unload code is blocked in destroy_qp() ==> deadlock.
<br>
<br>First, does my analysis make sense? <br> <br>Perhaps my device driver should do additional work in ib_destroy_qp()
that will trigger the destruction of the cm_id... [but that doesn't seem
consistent with other drivers I've seen.]
<br><br>Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm"
device is closed before calling ibv_device_close()?
<br><br>I'm just not sure if this is a driver issue, an application issue, or
something in between.
<br>Also, I don't have access to any other IWarp hardware, so I can't test
this scenario in a different environment...
<br>
<br>Any help/advice would be greatly appreciated!
<br>
<br>Thanks for your time,
<br>--Clark Tucker
<br>
<br>