[ofa-general] [BUG report / PATCH] fix race in the core multicast management
Sean Hefty
mshefty at ichips.intel.com
Thu Sep 20 11:48:04 PDT 2007
> We saw the bug on a uni processor system running the ipath driver, where the
> consumer is ipoib and the group being the IPv4 broadcast. When we take down
> the link of the switch port connected to the device across the cable, ipoib
> rushes to leave the group and then join it. On this system the join "crosses
> the leave" and the SA does not take into account the node when computing the
> multicast routing of the group --> the node does not get the broadcast traffic.
I've read back over this description a few times, and I still don't
fully grok the problem. Can you clarify if the following sequence is
what's happening?
1. The node has joined the multicast group. Meaning that the SA has
routed multicast traffic to the node.
2. You take down the link of the switch port that connects the node. Is
this done via a program?
3. The port is brought back online. This generates a PORT_ACTIVE event,
but the previous event was also PORT_ACTIVE.
4. ipoib leaves the group.
5. ipoib re-joins the group.
6. The multicast module isn't aware that any errors have occurred on the
multicast group, so simply completes the join request at step 5 without
SA involvement.
If I'm understanding this, somewhere in the above sequence the multicast
routing to this node is lost. Either the SA removed the node from the
group, or the switch lost its routing tables, or ...?
I'm also trying to understand how the problem would apply to a different
setup:
node 1 <-> switch A <-> switch B <-> switch C <-> SA
Suppose the same link down/up occurred between switch A and switch B.
What happens to the multicast members to the left of switch B? Will
node 1 see a PORT_ACTIVE event in this case as well?
- Sean
More information about the general
mailing list