[ofa-general] Re: multicast join failed for...

Michael S. Tsirkin mst at dev.mellanox.co.il
Thu Apr 12 07:08:43 PDT 2007


> Quoting Hal Rosenstock <halr at voltaire.com>:
> Subject: Re: multicast join failed for...
> 
> On Wed, 2007-04-11 at 23:38, Michael S. Tsirkin wrote:
> > > Quoting Hal Rosenstock <halr at voltaire.com>:
> > > Subject: Re: multicast join failed for...
> > > 
> > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > > > > Quoting Hal Rosenstock <halr at voltaire.com>:
> > > > > Subject: Re: multicast join failed for...
> > > > > 
> > > > > On Wed, 2007-04-11 at 14:12, Michael S. Tsirkin wrote:
> > > > > > > > If yes, I'm actually not too happy with this.
> > > > > > > > 
> > > > > > > > Would something like the following heuristic work better?
> > > > > > > > - select the max rate between all participants
> > > > > > > 
> > > > > > > The issue is that one doesn't know all the participants in a group as
> > > > > > > they are joined dynamically.
> > > > > > > 
> > > > > > > (I think we've been over this aspect on the list several times in the
> > > > > > > past.)
> > > > > > 
> > > > > > That's why I suggest the fix, so that the rate is adapted
> > > > > > dynamically.
> > > > > > 
> > > > > > > > - when a host with lower rate joins, destroy the group
> > > > > > > 
> > > > > > > I don't think a group can be destroyed like this "underneath" its
> > > > > > > existing members.
> > > > > > > 
> > > > > > 
> > > > > > Of course it can. That's what happens when SM is restarted.
> > > > > 
> > > > > Client reregistration ? I don't like using that big hammer as a solution
> > > > > to this. Seems a little harsh to me.
> > > > 
> > > > I think it's not too bad
> > > 
> > > It requires all subscriptions to reregister. This affects more things
> > > than just multicast or even the groups affected which might not be all
> > > of the multicast groups. Hence BIG hammer.
> > 
> > Changing an option in opensm config requires restarting
> > opensm. Isn't that right?
> 
> Yes but that doesn't have to be the case going forward in terms of
> OpenSM reconfig.
> >
> >  So its an even bigger hammer.
> 
> Restarting opensm is a slightly bigger hammer right now (than client
> reregistration) in the case the admin wants it "dynamic" but I suspect
> this only needs to be done once.

I think you forgot that currently one has to edit the config file,
just restarting opensm isn't enough :).
Let the user decide for us is a *HUGE* hammer - it usually solves
all problem, but at what cost?

> > > There could be a more
> > > graceful way to deal with this. I don't like using client reregister
> > > unless absolutely needed.
> > 
> > What are the other options that have the same funcitionality?
> 
> Perhaps a spec enhancement is possible to make this better.

Sure. Meanwhile, opensm will have to support legacy networks
too so I think we can start with the reregister solution.

> > > >  - previously we had some client failing join
> > > > which is worse.
> > > 
> > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > rather than degrade the group due to some link issue).
> > 
> > Rate could be an option, but I think generally people prefer
> > things working even if at a slower rate.
> 
> I think it's a coin flip.

I disagree. I think people that want the join to fail basically
just want to make debugging easy. We can help them without failing joins.

> I've seen it both ways and either way there
> are support questions.

I think we can solve this relatively easily: compare the bcast group
rate with local rate and have IPoIB produce a warning in log if these
do not match.

This is similiar to what we have with USB2.0 device in USB slot,
people seem to be happy.

> In the current scenario, it is join failures. In
> the proposed scenario, it is more subtle: performance implications and
> perhaps SA network storms.

I don't believe we'll see network storms: rate has to drop from DDR to SDR
only once.

-- 
MST



More information about the general mailing list