[ofa-general] Re: multicast join failed for...

Thu Apr 12 14:54:14 PDT 2007

On Thu, 12 Apr 2007 20:16:32 +0300
"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:

> > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> > 
> > On Thu, 12 Apr 2007 07:21:55 +0300
> > "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > 
> > > > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > > > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > > 
> > > > On 11 Apr 2007 17:45:54 -0400
> > > > Hal Rosenstock <halr at voltaire.com> wrote:
> > > > 
> > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > > > > 
> > > > > >  - previously we had some client failing join
> > > > > > which is worse.
> > > > > 
> > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > > > rather than degrade the group due to some link issue).
> > > > > 
> > > > 
> > > > Indeed, on a big cluster it would be better to have a few nodes dropped out
> > > > than to limit the speed of the entire cluster.
> > > 
> > > Why are you joining these nodes then?
> > > Anyway, could always be an option.
> > > 
> > 
> > We have seen a specific example where a nodes 4X link comes up at 1X.
> 
> I think that the way to do it, is to make it possible to force endnode link to
> a specific rate. You can already do this with a simple script
> from userspace, by testing the link rate once it comes up,
> and downing the link if it's lower than what you want.
> 
> If you think it's important, it's also quite trivial to
> make it possible to disable 1x support through sysfs interface.
> This way, the link will come up as 4x or not come up at all.
> Would that be useful?

Yes it would be useful.  Is this something I can do right now with OFED 1.1?

> 
> 
> > In this
> > case we would want the join to fail.  Basically a single hardware error,
> > isolated to 1 node, should not affect the other 1150 nodes,
> 
> As far as I know, there are *a lot* of reasons where a problem at
> 1 node will affect others on the same subnet. Do I have to give examples?
> I don't see why do we have to choose a specific instance (incorrect
> link rate at endnode) and handle it differently.
> 
> > which could very well be running a users job.
> 
> The job will continue running though, and when you diagnose the problem
> and disconnect the bad node, rate will be back to high.
> So what's the problem?

Performance impact between the time it happens and diagnosing the problem.
Yes, disabling the node is a better solution, however, the current behavior is
not bad for us.

> 
> > 
> > Certainly if there is a heterogeneous network we would want different behavior
> > but we don't operate any of our clusters like that.  After reading todays posts
> > I think it should be an option.
> 
> Yes. I think the option belongs at the endnodes, as outlined above.

Yes that would be a good solution as well.

> 
> > If someone has a mixture they can configure
> > it.  I am not sure what the default should be though.  I know we would want
> > the join to fail, but I understand the argument to allow it to work.
> 
> This likely means that you have a sideband interconnect infrastructure
> beside IPoIB. Otherwise, if the join fails, you don't even have a
> way to debug the problem.
> 

Yes we do have this.  Like I said I could see where this would be beneficial to
some users.

Ira