[ofa-general] Re: multicast join failed for...

Thu Apr 12 10:16:32 PDT 2007

> Quoting Ira Weiny <weiny2 at llnl.gov>:
> Subject: Re: [ofa-general] Re: multicast join failed for...
> 
> On Thu, 12 Apr 2007 07:21:55 +0300
> "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> 
> > > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > 
> > > On 11 Apr 2007 17:45:54 -0400
> > > Hal Rosenstock <halr at voltaire.com> wrote:
> > > 
> > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > > > 
> > > > >  - previously we had some client failing join
> > > > > which is worse.
> > > > 
> > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > > rather than degrade the group due to some link issue).
> > > > 
> > > 
> > > Indeed, on a big cluster it would be better to have a few nodes dropped out
> > > than to limit the speed of the entire cluster.
> > 
> > Why are you joining these nodes then?
> > Anyway, could always be an option.
> > 
> 
> We have seen a specific example where a nodes 4X link comes up at 1X.

I think that the way to do it, is to make it possible to force endnode link to
a specific rate. You can already do this with a simple script
from userspace, by testing the link rate once it comes up,
and downing the link if it's lower than what you want.

If you think it's important, it's also quite trivial to
make it possible to disable 1x support through sysfs interface.
This way, the link will come up as 4x or not come up at all.
Would that be useful?

> In this
> case we would want the join to fail.  Basically a single hardware error,
> isolated to 1 node, should not affect the other 1150 nodes,

As far as I know, there are *a lot* of reasons where a problem at
1 node will affect others on the same subnet. Do I have to give examples?
I don't see why do we have to choose a specific instance (incorrect
link rate at endnode) and handle it differently.

> which could very well be running a users job.

The job will continue running though, and when you diagnose the problem
and disconnect the bad node, rate will be back to high.
So what's the problem?

> 
> Certainly if there is a heterogeneous network we would want different behavior
> but we don't operate any of our clusters like that.  After reading todays posts
> I think it should be an option.

Yes. I think the option belongs at the endnodes, as outlined above.

> If someone has a mixture they can configure
> it.  I am not sure what the default should be though.  I know we would want
> the join to fail, but I understand the argument to allow it to work.

This likely means that you have a sideband interconnect infrastructure
beside IPoIB. Otherwise, if the join fails, you don't even have a
way to debug the problem.

-- 
MST