[openib-general] Failed multicast join withnew multicast module

Hal Rosenstock halr at voltaire.com
Fri Jun 9 15:01:15 PDT 2006


On Fri, 2006-06-09 at 17:18, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > What does mesh mean in this instance ? How do you know the multicast
> > routing tables are indeed valid and that the SM didn't corrupt them ?
> > (Why did the SM need restarting ?)
> 
> I meant that the values agree with each other, and there are no conflicts.

How are conflicts determined ? The SA has no way of querying the end
nodes for their multicast information; it currently is the other way
around.

> > The MLID is supplied by the SA in response to a group request from the
> > end node, not the other way around. The end node doesn't tell the SA
> > what MLID to use for a group.
> 
> One of the ideas is for the end nodes to provide this data, even if that means 
> extending the architecture.

OK. What if the SM already put the MLID to use for something else ?

> The problem is that the SA lost its state, but the network is working fine.

How does the SM know that the network is working fine ?

> The end nodes know which groups they have joined and the mapping of MGIDs to MLIDs. 
>   And the switches are already programmed correctly.

I'm not sure what constitutes a correctness criterion here.

> Even if we have the ability for an SM to transparently fail over to another SM, 
> because of the architecture, the end nodes are being forced to assume that all 
> multicast group information has been lost.

In the case of an SM which replicated its database, it would replicate
the registrations which include multicast so this reregistration
shouldn't be necessary. But I don't know of a way that the end node
knows whether the SM is doing this database replication.

> How about this?  What if the end nodes only re-joined their groups on LID_CHANGE 
> or CLIENT_REREGISTER events?  That is, an SM_CHANGE would not result in clients 
> needing to rejoin any groups.  This puts the burden on the SM to generate a 
> CLIENT_REREGISTER event only if it's needed.  SMs that can fail over and 
> maintain multicast state in the process would be able to do so.

I think more than this is needed.

-- Hal

> - Sean





More information about the general mailing list