[ofa-general] [PATCH] Fix racy deadlock in cma

Kanoj Sarcar kanoj at netxen.com
Wed Oct 3 11:44:25 PDT 2007


Roland Dreier wrote:

>I'll leave it to Sean and others who know the cma locking better than
>I do to comment on the patch in detail, but a few notes:
>
> - your patch is completely whitespace mangled so it would have to be
>   applied by hand.  please look into configuring your mail client so
>   that it can send patches without corrupting them.
>
> - patches should be generated so they apply with 'patch -p1', so
>   rather than what you have:
>
> > --- drivers/infiniband/core/cma.c       2006-12-13 17:14:23.000000000 -0800
> > +++ /tmp/cma.c  2007-10-03 00:48:32.000000000 -0700
>
>   the paths should be more like
>
>    --- a/drivers/infiniband/core/cma.c
>    +++ b/drivers/infiniband/core/cma.c
>  
>

Sorry, its been a while since I posted a patch. I am attaching a copy of 
it (not sure if patch attachments are ok), generated with "diff -Naurp".

> - I wonder which tree you generated your patch against: it seems to
>   be modifying cma_destroy_listen(), but you have the header:
>
> > @@ -624,6 +624,7 @@
>
>   and cma_destroy_listen() is nowhere near line 624 in my tree.  (And
>   it would be nice to use the '-p' option of diff to put the function
>   name there for easier reviewing)
>  
>
This was against 2.6.20.

> - Your comment doesn't make it clear to me that dropping and
>   reacquiring the lock is safe; can you explain why nothing else
>   could come along while the lock is dropped and mess things up?
>
>   It seems rdma_destroy_id() has the same pattern, but it's not clear
>   to me in the code:
>
>	mutex_lock(&lock);
>	if (id_priv->cma_dev) {
>		mutex_unlock(&lock);
>		// why can't the device be hot-unplugged here??
>		switch (rdma_node_get_transport(id->device->node_type)) {
>
>   what guarantees that the device does not disappear before it is
>   dereferenced in the switch statement.  This would be a separate bug
>   but we probably shouldn't introduce another instance of it
>   (assuming I'm correct).
>
> - R.
>
>  
>
Yes, rdma_destroy_id() has the same thing, where the lock is dropped; 
that was the inspiration for this fix too. I believe that by setting the 
cmid state to CMA_DESTROYING and still keeping it on the device's list, 
it should be ok to drop the lock and have a racing cma_process_remove() 
silently ignore this cmid.

Also notice that the destroying thread has to do a cma_detach_from_dev() 
to dec the refcount on the device before the device structure can be 
freed up.

By no means do I understand all the intricacies of the cma code, 
hopefully Sean/others will review and comment.

Thanks.

Kanoj


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pp
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071003/197042b1/attachment.ksh>


More information about the general mailing list