[ofa-general] [PATCH] Fix racy deadlock in cma
Kanoj Sarcar
kanoj at netxen.com
Wed Oct 3 11:44:25 PDT 2007
Roland Dreier wrote:
>I'll leave it to Sean and others who know the cma locking better than
>I do to comment on the patch in detail, but a few notes:
>
> - your patch is completely whitespace mangled so it would have to be
> applied by hand. please look into configuring your mail client so
> that it can send patches without corrupting them.
>
> - patches should be generated so they apply with 'patch -p1', so
> rather than what you have:
>
> > --- drivers/infiniband/core/cma.c 2006-12-13 17:14:23.000000000 -0800
> > +++ /tmp/cma.c 2007-10-03 00:48:32.000000000 -0700
>
> the paths should be more like
>
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
>
>
Sorry, its been a while since I posted a patch. I am attaching a copy of
it (not sure if patch attachments are ok), generated with "diff -Naurp".
> - I wonder which tree you generated your patch against: it seems to
> be modifying cma_destroy_listen(), but you have the header:
>
> > @@ -624,6 +624,7 @@
>
> and cma_destroy_listen() is nowhere near line 624 in my tree. (And
> it would be nice to use the '-p' option of diff to put the function
> name there for easier reviewing)
>
>
This was against 2.6.20.
> - Your comment doesn't make it clear to me that dropping and
> reacquiring the lock is safe; can you explain why nothing else
> could come along while the lock is dropped and mess things up?
>
> It seems rdma_destroy_id() has the same pattern, but it's not clear
> to me in the code:
>
> mutex_lock(&lock);
> if (id_priv->cma_dev) {
> mutex_unlock(&lock);
> // why can't the device be hot-unplugged here??
> switch (rdma_node_get_transport(id->device->node_type)) {
>
> what guarantees that the device does not disappear before it is
> dereferenced in the switch statement. This would be a separate bug
> but we probably shouldn't introduce another instance of it
> (assuming I'm correct).
>
> - R.
>
>
>
Yes, rdma_destroy_id() has the same thing, where the lock is dropped;
that was the inspiration for this fix too. I believe that by setting the
cmid state to CMA_DESTROYING and still keeping it on the device's list,
it should be ok to drop the lock and have a racing cma_process_remove()
silently ignore this cmid.
Also notice that the destroying thread has to do a cma_detach_from_dev()
to dec the refcount on the device before the device structure can be
freed up.
By no means do I understand all the intricacies of the cma code,
hopefully Sean/others will review and comment.
Thanks.
Kanoj
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pp
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071003/197042b1/attachment.ksh>
More information about the general
mailing list