[ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests

Kanoj Sarcar kanoj at netxen.com
Thu Oct 11 12:21:27 PDT 2007


Sean Hefty wrote:

>>cma_process_remove() -> cma_remove_id_dev() generates the event for
>>device removal. This is ok to do as long as it can be guaranteed that a
>>racing rdma_destroy_id() has not returned back to caller, correct?
>>
>>IE, the caller must be willing to accept device removal events until its
>>rdma_destroy_id() returns.
>>    
>>
>
>Correct - rdma_destroy_id() blocks until all callbacks from the rdma_cm have
>completed.
>
>  
>
>>If so, why is cma_remove_id_dev() trying so hard to not generate the
>>event when rdma_destroy_id() has gotten to the point of setting
>>CMA_DESTROYING? Could it not just generate the event, happy in the
>>knowledge that the refcount bump done by cma_process_remove() will
>>prevent the rdma_destroy_id() call from returning?
>>    
>>
>
>There are two ways for the user to destroy an rdma_cm_id.  They can either call
>rdma_destroy_id() directly or return a non-zero value from a callback.  In order
>to support the latter, all callbacks to a user on the same rdma_cm_id must be
>serialized, and once the user has returned a non-zero value no further callbacks
>can occur.  (Otherwise the user wouldn't know when it was safe to deallocate
>their connection context.)
> 
>Since a device removal can occur at any point, the device removal callback must
>be serialized with any other callback in progress.  It does this by marking that
>the device has been removed.  This prevents any new callbacks from being
>invoked, but a callback may already be in progress.  The device removal code
>waits for that callback to complete.  After it completes, it needs to see if the
>user wants to destroy the rdma_cm_id - meaning they returned a non-zero value
>from the first callback.  If so, then the device removal callback cannot be
>invoked.
>
>One other point is that all event callbacks for a given rdma_cm_id end up being
>serialized by default.  Only device removal event requires special handling,
>since that thread can run at any time.  If you look at some of the callback
>handlers (named *_handler), you'll see calls to disable/enable remove, which
>provides this serialization.
>
>- Sean
>
>  
>
Ok, thanks, I see how CMA_DESTROYING is used to correctly implement the 
callback initiated destruct.

I don't understand the reason for callback initiated destruct in the 
first place, but thats too off topic ...

With this new information, I will revisit the thread posted at 
http://lists.openfabrics.org/pipermail/general/2007-September/040614.html 
to see if really the problem being talked about there is non existant.

Kanoj



More information about the general mailing list