[openib-general] Failed multicast join withnew multicast module

Fri Jun 9 14:18:58 PDT 2006

Hal Rosenstock wrote:
> What does mesh mean in this instance ? How do you know the multicast
> routing tables are indeed valid and that the SM didn't corrupt them ?
> (Why did the SM need restarting ?)

I meant that the values agree with each other, and there are no conflicts.

> The MLID is supplied by the SA in response to a group request from the
> end node, not the other way around. The end node doesn't tell the SA
> what MLID to use for a group.

One of the ideas is for the end nodes to provide this data, even if that means 
extending the architecture.

The problem is that the SA lost its state, but the network is working fine.  The 
end nodes know which groups they have joined and the mapping of MGIDs to MLIDs. 
  And the switches are already programmed correctly.

Even if we have the ability for an SM to transparently fail over to another SM, 
because of the architecture, the end nodes are being forced to assume that all 
multicast group information has been lost.

How about this?  What if the end nodes only re-joined their groups on LID_CHANGE 
or CLIENT_REREGISTER events?  That is, an SM_CHANGE would not result in clients 
needing to rejoin any groups.  This puts the burden on the SM to generate a 
CLIENT_REREGISTER event only if it's needed.  SMs that can fail over and 
maintain multicast state in the process would be able to do so.

- Sean