[ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests

Sean Hefty sean.hefty at intel.com
Wed Oct 10 23:48:56 PDT 2007


>cma_process_remove() -> cma_remove_id_dev() generates the event for
>device removal. This is ok to do as long as it can be guaranteed that a
>racing rdma_destroy_id() has not returned back to caller, correct?
>
>IE, the caller must be willing to accept device removal events until its
>rdma_destroy_id() returns.

Correct - rdma_destroy_id() blocks until all callbacks from the rdma_cm have
completed.

>If so, why is cma_remove_id_dev() trying so hard to not generate the
>event when rdma_destroy_id() has gotten to the point of setting
>CMA_DESTROYING? Could it not just generate the event, happy in the
>knowledge that the refcount bump done by cma_process_remove() will
>prevent the rdma_destroy_id() call from returning?

There are two ways for the user to destroy an rdma_cm_id.  They can either call
rdma_destroy_id() directly or return a non-zero value from a callback.  In order
to support the latter, all callbacks to a user on the same rdma_cm_id must be
serialized, and once the user has returned a non-zero value no further callbacks
can occur.  (Otherwise the user wouldn't know when it was safe to deallocate
their connection context.)
 
Since a device removal can occur at any point, the device removal callback must
be serialized with any other callback in progress.  It does this by marking that
the device has been removed.  This prevents any new callbacks from being
invoked, but a callback may already be in progress.  The device removal code
waits for that callback to complete.  After it completes, it needs to see if the
user wants to destroy the rdma_cm_id - meaning they returned a non-zero value
from the first callback.  If so, then the device removal callback cannot be
invoked.

One other point is that all event callbacks for a given rdma_cm_id end up being
serialized by default.  Only device removal event requires special handling,
since that thread can run at any time.  If you look at some of the callback
handlers (named *_handler), you'll see calls to disable/enable remove, which
provides this serialization.

- Sean



More information about the general mailing list