[ofa-general] rping / librdmacm deadlock question

Steve Wise swise at opengridcomputing.com
Wed Jul 18 12:33:33 PDT 2007


Clark Tucker wrote:
> Hello all,
> 
> First, the background: I am writing a linux device driver to provide 
> IWarp device support for our hardware.  I'm currently running kernel 
> 2.6.20-rc4 and OFED-1.2-rc2.  I realize these are somewhat old, but I 
> have examined newer source, and haven't found any changes that seem 
> immediately relevant.
> 
> I am experiencing the following behavior:
> 
> rping -s ....  (server starts fine, loads proper user-space library, etc)
> 
> rping -c ... (client starts fine, ... connects to server, and exchanges 
> data successfully)
> So far so good.
> 
> If I interrupt the rping client with CTRL-C, then the client hangs hard.
> 
> I have, I believe, traced this to a deadlock between ib_destroy_qp() and 
> ucma_close(). It looks like librdmacm has a ((destructor)) function 
> defined that results in a call to ibv_device_close() and ultimately in 
> <device>::destroy_qp().   That seems reasonable, and it all happens as 
> the OS unloads the application. 
> 
> However, it is (I believe) happening before the "rdma_cm" device file 
> descriptor is 'closed' by the OS as the application terminates.  
> [rdma_destroy_event_channel() would normally do this, but it doesn't get 
> called when the application is interrupted by SIGINT.]
> 
> Our driver (as do all drivers I've seen) performs an 
> atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 
> 'destroy_qp()'.   Because the rdma_cm device hasn't been closed (i.e., 
> ucma_close() hasn't yet been called), a cm_id still has an active 
> reference to the qp, and the wait_event() will end up 'wait'ing.
> 

Your destroy_qp() method must destroy the active rdma connection which 
will force the iwcm to release the reference on the qp.  If you look at 
the chelsio driver, you'll see this is done before waiting on the refcnt 
to go to zero:

from iwch_destroy_qp():

>         attrs.next_state = IWCH_QP_STATE_ERROR;
>         iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
>         wait_event(qhp->wait, !qhp->ep);


Once the qhp->ep handle has been disassociated from the qp, the driver 
knows the iwcm has been given the CLOSE event and removed its reference 
on the qp.  Here is the iwcm close event handler.  Note it removes the
ref:

 From cm_close_handler():

>         if (cm_id_priv->qp) {
>                 cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
>                 cm_id_priv->qp = NULL;
>         }


It then can wait for any further references from interrupt handlers:

> 
>         atomic_dec(&qhp->refcnt);
>         wait_event(qhp->wait, !atomic_read(&qhp->refcnt));


> Perhaps my device driver should do additional work in ib_destroy_qp() 
> that will trigger the destruction of the cm_id... [but that doesn't seem 
> consistent with other drivers I've seen.]
>

Are you looking at the chelsio or ammaso iwarp drivers?  This code is 
all iwarp specific...


Hope this helps...


Steve



More information about the general mailing list