[ofa-general] Re: multicast join failed for...

Michael S. Tsirkin mst at dev.mellanox.co.il
Thu Apr 12 21:17:38 PDT 2007


> Quoting Ira Weiny <weiny2 at llnl.gov>:
> Subject: Re: [ofa-general] Re: multicast join failed for...
> 
> On Thu, 12 Apr 2007 20:16:32 +0300
> "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> 
> > > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > 
> > > On Thu, 12 Apr 2007 07:21:55 +0300
> > > "Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:
> > > 
> > > > > Quoting Ira Weiny <weiny2 at llnl.gov>:
> > > > > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > > > 
> > > > > On 11 Apr 2007 17:45:54 -0400
> > > > > Hal Rosenstock <halr at voltaire.com> wrote:
> > > > > 
> > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > >  - previously we had some client failing join
> > > > > > > which is worse.
> > > > > > 
> > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate
> > > > > > rather than degrade the group due to some link issue).
> > > > > > 
> > > > > 
> > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out
> > > > > than to limit the speed of the entire cluster.
> > > > 
> > > > Why are you joining these nodes then?
> > > > Anyway, could always be an option.
> > > > 
> > > 
> > > We have seen a specific example where a nodes 4X link comes up at 1X.
> > 
> > I think that the way to do it, is to make it possible to force endnode link to
> > a specific rate. You can already do this with a simple script
> > from userspace, by testing the link rate once it comes up,
> > and downing the link if it's lower than what you want.
> > 
> > If you think it's important, it's also quite trivial to
> > make it possible to disable 1x support through sysfs interface.
> > This way, the link will come up as 4x or not come up at all.
> > Would that be useful?
> 
> Yes it would be useful.

OK, I'll work on a patch for OFED 1.2.

> Is this something I can do right now with OFED 1.1?

With OFED 1.1 (without patches) you can do what I wrote above:
write a script that tests link width.
Disable ipoib, or the device, if it is 1x:

For example

#/usr/bin/bash
until
	grep ACTIVE /sys/class/infiniband/mthca0/ports/*/state;
do
	true;
done


if `grep 1x /sys/class/infiniband/mthca0/ports/1/rate`
then
	rmmod ib_mthca
fi

> > 
> > 
> > > In this
> > > case we would want the join to fail.  Basically a single hardware error,
> > > isolated to 1 node, should not affect the other 1150 nodes,
> > 
> > As far as I know, there are *a lot* of reasons where a problem at
> > 1 node will affect others on the same subnet. Do I have to give examples?
> > I don't see why do we have to choose a specific instance (incorrect
> > link rate at endnode) and handle it differently.
> > 
> > > which could very well be running a users job.
> > 
> > The job will continue running though, and when you diagnose the problem
> > and disconnect the bad node, rate will be back to high.
> > So what's the problem?
> 
> Performance impact between the time it happens and diagnosing the problem.
> Yes, disabling the node is a better solution, however, the current behavior is
> not bad for us.

Hal, here we have a use case that I think shows that the right thing
is by default to make joins succeed. Convinced?

> > 
> > > 
> > > Certainly if there is a heterogeneous network we would want different behavior
> > > but we don't operate any of our clusters like that.  After reading todays posts
> > > I think it should be an option.
> > 
> > Yes. I think the option belongs at the endnodes, as outlined above.
> 
> Yes that would be a good solution as well.
> 
> > 
> > > If someone has a mixture they can configure
> > > it.  I am not sure what the default should be though.  I know we would want
> > > the join to fail, but I understand the argument to allow it to work.
> > 
> > This likely means that you have a sideband interconnect infrastructure
> > beside IPoIB. Otherwise, if the join fails, you don't even have a
> > way to debug the problem.
> > 
> 
> Yes we do have this.  Like I said I could see where this would be beneficial to
> some users.


-- 
MST



More information about the general mailing list