[ofa-general] [BUG report / PATCH] fix race in the core multicast management

Tue Sep 18 10:12:21 PDT 2007

> It is possible for the multicast consumer to call ib_sa_free_multicast() where
> this leave request is queued to be later processed by the workqueue thread, and
> then call ib_sa_join_multicast() which calls acquire_group() --before-- the leave
> request was excecuted by the thread. So the lookup done by acquire_group() succeeds,
> the code goes to the found: label and the group reference count climbs to (eg) 2.

Yes - this is possible.  Note that although the group reference count is 
2, joins are tracked in different lists: active_list or pending_list. 
The second join doesn't move to the active_list until it's processed by 
the callback thread, to synchronize against errors and leaves.

> Following that the leave work-element causes the thread to just dec the
> reference count to 1 in release_group() and do nothing else, and the join
> work-element causes the thread to return the cached address-handle attributes
> to the consumer. So no sa query is being sent to the SA.

This sounds like the correct behavior.

> We saw the bug on a uni processor system running the ipath driver, where the
> consumer is ipoib and the group being the IPv4 broadcast. When we take down
> the link of the switch port connected to the device across the cable, ipoib
> rushes to leave the group and then join it. On this system the join "crosses
> the leave" and the SA does not take into account the node when computing the
> multicast routing of the group --> the node does not get the broadcast traffic.

Does the SA remove the node from the multicast group?  If the HCA port 
goes down, the multicast code will transition all existing multicast 
groups to the error state.  An error will be reported on active joins. 
Pending joins will be processed normally after error handling has completed.

> For now we have applied a work around which causes the multicast code to
> call release_group() from ib_sa_free_multicast(). The workaround is
> implemented by using the patch below which causes mcast_groups_lost()
> to be called also when the port actually goes up, and set the group state
> to MCAST_ERROR such that the call to release_group() is not deferred (ipoib
> does leave/join for every event, namely both on link down and up).

I'm wondering if the problem isn't in ipoib.  When an error occurs on a 
multicast group, the group transitions into the error state, and the 
user is called back to let them know that they need to rejoin the group. 
  Since ipoib responds directly to port events and not multicast 
callback errors, is there a chance ipoib missed the error notification?

In short, I'm still not sure where the problem lies.

- Sean