[ofa-general] Re: multicast join failed for...
Ira Weiny
weiny2 at llnl.gov
Thu Apr 12 08:46:23 PDT 2007
On Thu, 12 Apr 2007 07:21:55 +0300
"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> >
> > On 11 Apr 2007 17:45:54 -0400
> > Hal Rosenstock <halr at voltaire.com> wrote:
> >
> > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > >
> > > > - previously we had some client failing join
> > > > which is worse.
> > >
> > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > rather than degrade the group due to some link issue).
> > >
> >
> > Indeed, on a big cluster it would be better to have a few nodes dropped out
> > than to limit the speed of the entire cluster.
>
> Why are you joining these nodes then?
> Anyway, could always be an option.
>
We have seen a specific example where a nodes 4X link comes up at 1X. In this
case we would want the join to fail. Basically a single hardware error,
isolated to 1 node, should not affect the other 1150 nodes, which could very
well be running a users job.
Certainly if there is a heterogeneous network we would want different behavior
but we don't operate any of our clusters like that. After reading todays posts
I think it should be an option. If someone has a mixture they can configure
it. I am not sure what the default should be though. I know we would want
the join to fail, but I understand the argument to allow it to work.
Ira
More information about the general
mailing list