[ofa-general] Re: multicast join failed for...

Ira Weiny weiny2 at llnl.gov
Thu Apr 12 08:46:23 PDT 2007


On Thu, 12 Apr 2007 07:21:55 +0300
"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:

> > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> > 
> > On 11 Apr 2007 17:45:54 -0400
> > Hal Rosenstock <halr at voltaire.com> wrote:
> > 
> > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > > 
> > > >  - previously we had some client failing join
> > > > which is worse.
> > > 
> > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > rather than degrade the group due to some link issue).
> > > 
> > 
> > Indeed, on a big cluster it would be better to have a few nodes dropped out
> > than to limit the speed of the entire cluster.
> 
> Why are you joining these nodes then?
> Anyway, could always be an option.
> 

We have seen a specific example where a nodes 4X link comes up at 1X.  In this
case we would want the join to fail.  Basically a single hardware error,
isolated to 1 node, should not affect the other 1150 nodes, which could very
well be running a users job.

Certainly if there is a heterogeneous network we would want different behavior
but we don't operate any of our clusters like that.  After reading todays posts
I think it should be an option.  If someone has a mixture they can configure
it.  I am not sure what the default should be though.  I know we would want
the join to fail, but I understand the argument to allow it to work.

Ira



More information about the general mailing list