[openib-general] Failed multicast join withnew multicast module
Sean Hefty
mshefty at ichips.intel.com
Fri Jun 9 14:18:58 PDT 2006
Hal Rosenstock wrote:
> What does mesh mean in this instance ? How do you know the multicast
> routing tables are indeed valid and that the SM didn't corrupt them ?
> (Why did the SM need restarting ?)
I meant that the values agree with each other, and there are no conflicts.
> The MLID is supplied by the SA in response to a group request from the
> end node, not the other way around. The end node doesn't tell the SA
> what MLID to use for a group.
One of the ideas is for the end nodes to provide this data, even if that means
extending the architecture.
The problem is that the SA lost its state, but the network is working fine. The
end nodes know which groups they have joined and the mapping of MGIDs to MLIDs.
And the switches are already programmed correctly.
Even if we have the ability for an SM to transparently fail over to another SM,
because of the architecture, the end nodes are being forced to assume that all
multicast group information has been lost.
How about this? What if the end nodes only re-joined their groups on LID_CHANGE
or CLIENT_REREGISTER events? That is, an SM_CHANGE would not result in clients
needing to rejoin any groups. This puts the burden on the SM to generate a
CLIENT_REREGISTER event only if it's needed. SMs that can fail over and
maintain multicast state in the process would be able to do so.
- Sean
More information about the general
mailing list