[ewg] Re: OFED 1.2 beta blocking bugs

Roland Dreier rdreier at cisco.com
Thu Mar 8 10:34:56 PST 2007


 > Is there a way to reproduce this with the standard linux build?  (I
 > didn't closely follow the original IPOIB HA threads, so I will look
 > back over those.)

Not sure what you're asking, but just to be clear, this IPoIB HA is
entirely in userspace (it's a crazy perl script that ups and downs
ports in response to various events).

>From a quick look at the code, it does look like there are some races
in ipoib_multicast.c.  The place where a QP is actually attached to a
group is essentially (trimming debug prints):

		if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags))
			return 0;

		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
					 &mcast->mcmember.mgid);

and the place where a QP is detached is:

	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
		ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid),
					 &mcast->mcmember.mgid);

with no further locking.  So it looks entirely possible for one thread
to do the test_and_set_bit(), and then have another thread come in and
do the test_and_clear_bit (which will show the bit as set) and call
ipoib_mcast_detach() before the first thread has reached the actual
call to ipoib_mcast_attach.

Maybe the solution is just to take the mcast_mutex around the full
operation.  There's some hokey and very old stuff around the multicast
attach and detach verbs calls too.  I'll post a patch later today if I
get a chance.  Unfortunately I haven't really kept up with all the
OFED built stuff -- does anyone know an easy way for Scott to take a
kernel patch and rebuild his OFED install?

 - R.




More information about the ewg mailing list