[ewg] Re: OFED 1.2 beta blocking bugs

Sean Hefty sean.hefty at intel.com
Thu Mar 8 10:56:11 PST 2007


>Not sure what you're asking, but just to be clear, this IPoIB HA is
>entirely in userspace (it's a crazy perl script that ups and downs
>ports in response to various events).

Thanks - this helps.

>From a quick look at the code, it does look like there are some races
>in ipoib_multicast.c.  The place where a QP is actually attached to a
>group is essentially (trimming debug prints):
>
>		if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags))
>			return 0;
>
>		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
>					 &mcast->mcmember.mgid);
>
>and the place where a QP is detached is:
>
>	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
>		ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid),
>					 &mcast->mcmember.mgid);
>
>with no further locking.  So it looks entirely possible for one thread
>to do the test_and_set_bit(), and then have another thread come in and
>do the test_and_clear_bit (which will show the bit as set) and call
>ipoib_mcast_detach() before the first thread has reached the actual
>call to ipoib_mcast_attach.

I was looking at this part of the code as well, and this explains the error
messages in the bug report:

	ib1: ib_detach_mcast failed (result = -22)
	ib1: ipoib_mcast_detach failed (result = -22)

But it seems like this would leave the QP incorrectly attached to the multicast
group.  It's still not clear to me why we see the following message:

	ib1: dev_queue_xmit failed to requeue packet

or why traffic stops.

>Maybe the solution is just to take the mcast_mutex around the full
>operation.  There's some hokey and very old stuff around the multicast
>attach and detach verbs calls too.

ipoib_mcast_attach() is only called from the multicast module callback.  If
ib_sa_free_multicast() were called earlier in ipoib_mcast_leave(), it would
block until the callback completed, which should avoid the race as well.  I need
to spend more time studying the code to see if this works in all cases.

- Sean




More information about the ewg mailing list