[ofa-general] Re: multicast join failed for...

Fri Apr 13 09:08:39 PDT 2007

On 13 Apr 2007 07:37:04 -0400
Hal Rosenstock <halr at voltaire.com> wrote:

> On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote:
>
> > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > 
> > > On Thu, 12 Apr 2007 20:16:32 +0300
> > > "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > > > 
> > > > The job will continue running though, and when you diagnose the problem
> > > > and disconnect the bad node, rate will be back to high.
> > > > So what's the problem?
> > > 
> > > Performance impact between the time it happens and diagnosing the problem.
> > > Yes, disabling the node is a better solution, however, the current behavior is
> > > not bad for us.
> > 
> > Hal, here we have a use case that I think shows that the right thing
> > is by default to make joins succeed. Convinced?
> 
> Didn't Ira say that "the current behavior is not bad for us" ? The
> current behavior is default 4x SDR rate which makes slower joins fail.
> 
> Are you saying change the default rate to 1x SDR ? I've been concerned
> about masking performance issues when doing this as we've discussed
> several times before.
> 

Indeed I said "NOT" bad.  We do NOT want the performance to come down.  If this
happens silently on a Friday night the cluster could run all weekend at a
reduced rate.

I am thinking that a check on the node's link is a good idea.  It would also be
able to better diagnose the problem.

Thanks,
Ira