[openib-general] [RFC] [PATCH 2/7] ib_multicast 2.6.20: add ib_multicast module to track join requests from the same port

Thu Oct 12 11:13:32 PDT 2006

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [openib-general] [RFC] [PATCH 2/7] ib_multicast 2.6.20: add ib_multicast module to track join requests from the same port
> 
> >But what is the issue? some kind of race?
> 
> If we look at just the ib_multicast patches as an example...
> 
> Calling ib_join_multicast allocates a struct ib_multicast that must be freed.
> Here's the relevant portion of ipoib's join callback:
> 
> @@ -325,11 +328,10 @@ ipoib_mcast_sendonly_join_complete(int s
>  		/* Clear the busy flag so we try again */
> +		status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
> +					    &mcast->flags);
>  	}
> +	return status;
>  }
> 
> The callback clears the busy flag, and frees the structure by returning a
> non-zero value from the callback.  (This is convenient for error handling.)  Let
> the callback thread hang around right at the return statement for a while.
> 
> When ipoib is unloaded, one of the calls it makes during cleanup is
> ipoib_mcast_leave(), which does:
> 
> 	if(test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> 		ib_free_multicast(mcast->mc);
> 
> If ipoib_mcast_leave() is called at the same time that an error is reported
> through the callback, it's possible that the struct ib_multicast will be freed
> by the callback thread.  But there's nothing to prevent the callback thread from
> executing in the ipoib code after unload has occurred.
> 
> Similar issues can apply to ib_cm and rdma_cm.
> 
> - Sean
> 

But unlike the sa races which were unfixable without API changes,
here users can synchronize the removal of the mc object.
So I think what you describe is a user error.

For example SDP does 
lock
cm_id = ssk->cm_id
ssk->cm_id = NULL
unlock

if (cm_id)
destroy(cm_id)

and in callback

ssk = context
lock
if (!ssk->cm_id) {
	unlock
	return 0;
}

...

-- 
MST