[openib-general] got ipoib up once but not twice :-)

shaharf shaharf at voltaire.com
Mon Jan 17 09:23:14 PST 2005


Roland,
	Can you please run ibnetdiscover on one of the openib gen2 hosts
and send me the output file?
	Please verify that there are no errors during ibnetdiscover run.
Without errors I believe that ibnetdiscover will scan the net withing
less then 30 sec (maybe even 10 sec). If there are, it means that you
have bad links. Gazelle systems are known to have such problems. If you
encounter such errors (timeouts, or others) please send me the errors
text too, and I will try to guide you how to find the bad links. The
opensm may be very sensitive to such bad links, and it may significantly
delay its operation.

	Shahar

> -----Original Message-----
> From: openib-general-bounces at openib.org [mailto:openib-general-
> bounces at openib.org] On Behalf Of Hal Rosenstock
> Sent: Monday, January 17, 2005 5:27 PM
> To: Matt L. Leininger
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] got ipoib up once but not twice :-)
> 
> On Sat, 2005-01-15 at 18:18, Matt Leininger wrote:
> > On Sat, 2005-01-15 at 07:27 -0500, Hal Rosenstock wrote:
> > > On Fri, 2005-01-14 at 23:10, Ronald G. Minnich wrote:
> > > > Hmm, it's back. I guess I was not patient enough. Not sure when
it
> all got
> > > > back. I will have to time it next time, I assume it won't take 6
> hours
> > > > each time :-)
> > > >
> > > > I'm working on making this 256-node cluster work over infiniband
> only,
> > > > same as our myrinet clusters which are myrinet-only.
> > >
> > > How many 96 port switches ? I'd be curious how long it does take
to
> > > initialize this (as I do not have access to a large cluster).
Also,
> > > right now I'm pretty sure things are being done without pipelining
on
> so
> > > it is likely slower. More on this later.
> > >
> >
> >   Ron has 9 96 port switches, 3 in the spine and 6 leaf switches all
> > based on the InfiniScale II switch ASIC, to make a 288 port fabric.
> > It's not quite a true fat-tree network since there are no spine
bypass
> > cards for the older 96 port switch (needed on the leaf switches).
This
> > is an interesting test case because Ron's network has a total of 612
> > switch chips.  A 1152 port fat-tree fabric based on InfiniScale III
> > would have 240 switch chips.
> 
> That's a big cluster :-) There are several things that can be done
which
> should improve this but I would be curious as to how long the
> initialization really does take.
> 
> 1. Currently OpenSM is compiled with debug and no optimization. This
> should be changed to at least -O2 (and perhaps -O4) but I would start
> with -O2. This results in a 2x speedup for some code paths.
> 
> 2. OpenSM supports a pipelining mode for SMPs. The default is 1
> outstanding SMP. -maxsmps <#> indicates the number of outstanding SMPs
> allowed and should speed up the initialization. Useful values of this
> are 16 and 32.
> 
> Beyond this, there may be some issue with a link which is causing
> timeout and retries to kick in. The OSM log should have some messages
in
> there indicating this.
> 
> I will add this info to the management README.
> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general



More information about the general mailing list