[ofa-general] Re: multicast join failed for...

Hal Rosenstock halr at voltaire.com
Fri Apr 13 05:29:17 PDT 2007


On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote:
> > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> > 
> > On Thu, 12 Apr 2007 20:16:32 +0300
> > "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > 
> > > > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > > > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > > 
> > > > On Thu, 12 Apr 2007 07:21:55 +0300
> > > > "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > > > 
> > > > > > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > > > > > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > > > > 
> > > > > > On 11 Apr 2007 17:45:54 -0400
> > > > > > Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > > 
> > > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > > > > > > 
> > > > > > > >  - previously we had some client failing join
> > > > > > > > which is worse.
> > > > > > > 
> > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > > > > > rather than degrade the group due to some link issue).
> > > > > > > 
> > > > > > 
> > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out
> > > > > > than to limit the speed of the entire cluster.
> > > > > 
> > > > > Why are you joining these nodes then?
> > > > > Anyway, could always be an option.
> > > > > 
> > > > 
> > > > We have seen a specific example where a nodes 4X link comes up at 1X.
> > > 
> > > I think that the way to do it, is to make it possible to force endnode link to
> > > a specific rate. You can already do this with a simple script
> > > from userspace, by testing the link rate once it comes up,
> > > and downing the link if it's lower than what you want.
> > > 
> > > If you think it's important, it's also quite trivial to
> > > make it possible to disable 1x support through sysfs interface.
> > > This way, the link will come up as 4x or not come up at all.
> > > Would that be useful?
> > 
> > Yes it would be useful.
> 
> OK, I'll work on a patch for OFED 1.2.
> 
> > Is this something I can do right now with OFED 1.1?
> 
> With OFED 1.1 (without patches) you can do what I wrote above:
> write a script that tests link width.
> Disable ipoib, or the device, if it is 1x:
> 
> For example
> 
> #/usr/bin/bash
> until
> 	grep ACTIVE /sys/class/infiniband/mthca0/ports/*/state;
> do
> 	true;
> done
> 
> 
> if `grep 1x /sys/class/infiniband/mthca0/ports/1/rate`
> then
> 	rmmod ib_mthca
> fi
> 
> > > 
> > > 
> > > > In this
> > > > case we would want the join to fail.  Basically a single hardware error,
> > > > isolated to 1 node, should not affect the other 1150 nodes,
> > > 
> > > As far as I know, there are *a lot* of reasons where a problem at
> > > 1 node will affect others on the same subnet. Do I have to give examples?
> > > I don't see why do we have to choose a specific instance (incorrect
> > > link rate at endnode) and handle it differently.
> > > 
> > > > which could very well be running a users job.
> > > 
> > > The job will continue running though, and when you diagnose the problem
> > > and disconnect the bad node, rate will be back to high.
> > > So what's the problem?
> > 
> > Performance impact between the time it happens and diagnosing the problem.
> > Yes, disabling the node is a better solution, however, the current behavior is
> > not bad for us.
> 
> Hal, here we have a use case that I think shows that the right thing
> is by default to make joins succeed. Convinced?

Didn't Ira say that "the current behavior is not bad for us" ? The
current behavior is default 4x SDR rate which makes slower joins fail.

Are you saying change the default rate to 1x SDR ? I've been concerned
about masking performance issues when doing this as we've discussed
several times before.

-- Hal

> > > 
> > > > 
> > > > Certainly if there is a heterogeneous network we would want different behavior
> > > > but we don't operate any of our clusters like that.  After reading todays posts
> > > > I think it should be an option.
> > > 
> > > Yes. I think the option belongs at the endnodes, as outlined above.
> > 
> > Yes that would be a good solution as well.
> > 
> > > 
> > > > If someone has a mixture they can configure
> > > > it.  I am not sure what the default should be though.  I know we would want
> > > > the join to fail, but I understand the argument to allow it to work.
> > > 
> > > This likely means that you have a sideband interconnect infrastructure
> > > beside IPoIB. Otherwise, if the join fails, you don't even have a
> > > way to debug the problem.
> > > 
> > 
> > Yes we do have this.  Like I said I could see where this would be beneficial to
> > some users.
> 




More information about the general mailing list