[ofa-general] [BUG report / PATCH] fix race in the core multicast management

Thu Sep 20 06:52:43 PDT 2007

Sean Hefty wrote:
> Yes - this is possible.  Note that although the group reference count is 
> 2, joins are tracked in different lists: active_list or pending_list. 
> The second join doesn't move to the active_list until it's processed by 
> the callback thread, to synchronize against errors and leaves.

I see, however, there is no second join here, its a leave and join where 
the group refcount climbs to 2 since the the join code inc it on its 
synchronous part which is executed before the thread handles the 
processing of the leave request.

>> Following that the leave work-element causes the thread to just dec the
>> reference count to 1 in release_group() and do nothing else, and the join
>> work-element causes the thread to return the cached address-handle attributes
>> to the consumer. So no sa query is being sent to the SA.

> This sounds like the correct behavior.

I am not sure this is what we want from the core design. Say the 
consumer has some flexibility in the join request (eg through future api 
change), such that they can join a group, leave it, then join again this 
group with different "attributes". Then if the join crosses the leave in 
a way that causes the core code not to issue sa leave/join queries, its 
a bug from the perspective of the user.

> Does the SA remove the node from the multicast group?  If the HCA port 
> goes down, the multicast code will transition all existing multicast 
> groups to the error state.  An error will be reported on active joins. 
> Pending joins will be processed normally after error handling has 
> completed.

OK, on this specific host system there was no port down event! so the 
only event that the multicast and ipoib code got was port active. This 
is why the patch I sent solves (hides) the problem, it causes the 
multicast code to transition the group into the error state, so the 
ipoib join that follows causes an sa join query to be actually sent.

> I'm wondering if the problem isn't in ipoib.  When an error occurs on a 
> multicast group, the group transitions into the error state, and the 
> user is called back to let them know that they need to rejoin the group. 
>  Since ipoib responds directly to port events and not multicast callback 
> errors, is there a chance ipoib missed the error notification?

I don't think there's a problem in ipoib, it just does not rely on 
multicast error notifications but rather on port events. Do you think 
its less robust, and if yes, why?

We decided to use port notifications also in a user space multicast app, 
since the multicast notification is delivered also on port error and as 
long as the port is down, we can't really join, so instead of 
implementing timer/retries, we join on port events (active, sm/my lid 
change, client re-register, etc).

Or.