[ofa-general] Re: multicast join failed for...
weiny2 at llnl.gov
Fri Apr 13 09:08:39 PDT 2007
On 13 Apr 2007 07:37:04 -0400
Hal Rosenstock <halr at voltaire.com> wrote:
> On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote:
> > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> > >
> > > On Thu, 12 Apr 2007 20:16:32 +0300
> > > "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > > >
> > > > The job will continue running though, and when you diagnose the problem
> > > > and disconnect the bad node, rate will be back to high.
> > > > So what's the problem?
> > >
> > > Performance impact between the time it happens and diagnosing the problem.
> > > Yes, disabling the node is a better solution, however, the current behavior is
> > > not bad for us.
> > Hal, here we have a use case that I think shows that the right thing
> > is by default to make joins succeed. Convinced?
> Didn't Ira say that "the current behavior is not bad for us" ? The
> current behavior is default 4x SDR rate which makes slower joins fail.
> Are you saying change the default rate to 1x SDR ? I've been concerned
> about masking performance issues when doing this as we've discussed
> several times before.
Indeed I said "NOT" bad. We do NOT want the performance to come down. If this
happens silently on a Friday night the cluster could run all weekend at a
I am thinking that a check on the node's link is a good idea. It would also be
able to better diagnose the problem.
More information about the general