[ewg] Re: OFED 1.2 beta blocking bugs
Roland Dreier
rdreier at cisco.com
Thu Mar 8 10:34:56 PST 2007
> Is there a way to reproduce this with the standard linux build? (I
> didn't closely follow the original IPOIB HA threads, so I will look
> back over those.)
Not sure what you're asking, but just to be clear, this IPoIB HA is
entirely in userspace (it's a crazy perl script that ups and downs
ports in response to various events).
>From a quick look at the code, it does look like there are some races
in ipoib_multicast.c. The place where a QP is actually attached to a
group is essentially (trimming debug prints):
if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags))
return 0;
ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
&mcast->mcmember.mgid);
and the place where a QP is detached is:
if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid),
&mcast->mcmember.mgid);
with no further locking. So it looks entirely possible for one thread
to do the test_and_set_bit(), and then have another thread come in and
do the test_and_clear_bit (which will show the bit as set) and call
ipoib_mcast_detach() before the first thread has reached the actual
call to ipoib_mcast_attach.
Maybe the solution is just to take the mcast_mutex around the full
operation. There's some hokey and very old stuff around the multicast
attach and detach verbs calls too. I'll post a patch later today if I
get a chance. Unfortunately I haven't really kept up with all the
OFED built stuff -- does anyone know an easy way for Scott to take a
kernel patch and rebuild his OFED install?
- R.
More information about the ewg
mailing list