[ofa-general] rping / librdmacm deadlock question

Clark Tucker clark.tucker at gmail.com
Wed Jul 18 11:18:10 PDT 2007


Thanks for the quick reply.  Comments below.

On 7/18/07, Roland Dreier <rdreier at cisco.com> wrote:
>
> > Our driver (as do all drivers I've seen) performs an
> > atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in
> 'destroy_qp()'.
> > Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't
> yet
> > been called), a cm_id still has an active reference to the qp, and the
> > wait_event() will end up 'wait'ing.
>
> In the other drivers I know well (basically mthca and mlx4, since I
> wrote them), the qp->refcount being waited for is an internal driver
> refcount, and is used to make sure that the destroy QP operation waits
> until any active interrupt handlers are done with the QP.  So I think
> the problem is that you are letting a cm_id bump the QP's reference
> count somehow.


I guess this really is relevant only for IWarp.  Other IWarp drivers I've
seen do an atomic_inc(&qp->refcount) in <device>::qp_add_ref().
Called via cm_id->device->iwcm->add_ref()?. [For example see:
iwcm.c::iw_cm_connect()].
This reference is removed by a call to cm_id->device->iwcm->rem_ref() [For
example see: iwcm::destroy_cm_id()].

And, to avoid a deadlock, I still believe that this must happen _before_
ib_uverbs_close() [ and ultimately ib_destroy_qp()] is called.

> Perhaps my device driver should do additional work in ib_destroy_qp() that
> > will trigger the destruction of the cm_id... [but that doesn't seem
> > consistent with other drivers I've seen.]
>
> That doesn't make sense.  I think it's OK if upper layers are left
> with a stale pointer to your QP -- let them worry about it.  Maybe
> it's an iWARP thing that I don't really understand (I'm much more
> familiar with the IB driver interface) but I don't think that the
> cxgb3 driver runs into this issue.
>
> > Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm"
> > device is closed before calling ibv_device_close()?
>
> No, because then some other (possibly malicious) app could still cause
> the deadlock and potentially create a bunch of unkillable processes.


Very true...good point.

- R.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/7aaf89e6/attachment.html>


More information about the general mailing list