[ofa-general] RE: running opensm 3.0.3 on 4000+ node system

Hal Rosenstock hrosenstock at xsigo.com
Wed Apr 9 14:17:50 PDT 2008


On Wed, 2008-04-09 at 15:13 -0600, Maestas, Christopher Daniel wrote:
> I think we may have fixed it:
> ---
>  3998 pts/0    Sl     1:47 /usr/sbin/opensm -maxsmps 15 -t 200 -f /var/log/osm.log -g 0
> --
> 
> I changed maxsmps to 15 (from default of 0 => unlimited) and it seems to be working now. 
>  That is the same value we use for the cisco host based sm.

Yes, an infinite value could overrun the unflow controlled VL15 buffers
in the switches. Guess this should be noted somewhere in the
documentation/man pages.

> ---
> Apr  9 14:43:17 HOST OpenSM[3998]: /var/log/osm.log log file opened
> Apr  9 14:43:17 HOST OpenSM[3998]: OpenSM Rev:openib-3.0.13
> Apr  9 14:43:17 HOST kernel: user_mad: process opensm did not enable P_Key index support.
> Apr  9 14:43:17 HOST kernel: user_mad:   Documentation/infiniband/user_mad.txt has info on the new ABI.
> Apr  9 14:43:30 HOST OpenSM[3998]: Entering MASTER state
> Apr  9 14:43:54 HOST OpenSM[3998]: SUBNET UP
> ---
> 
> The log file is not growing like crazy anymore ...

So it is the SM which caused this by mismatching peer port OpVLs.

-- Hal

> I did forget to mention we are running a new mellanox firmware on the HCA too and switches ... been about 2 years since we last tested. :)

> I'm looking for the previous method in which it was run, and I don't recall making this change before.  It could be due to all the other changes since then.  But now I know how to get it going and my work is hopefully archived in this mailing list. ;)
> 
> Thanks,
> -cdm
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:hrosenstock at xsigo.com]
> Sent: Wednesday, April 09, 2008 2:36 PM
> To: Maestas, Christopher Daniel
> Cc: general at lists.openfabrics.org
> Subject: Re: running opensm 3.0.3 on 4000+ node system
> 
> On Wed, 2008-04-09 at 12:26 -0600, Maestas, Christopher Daniel wrote:
> 
> > I have ofed 1.1 and 1.2 drivers loaded on the system.  I've done this in the past using opensm 3.0.0 svn tag 10188 from ofed 1.0 clients and had no issues before.  Here's how opensm is running:
> 
> Which OpenSM was run before ? Also, which kernel is being used and what is meant by both ofed 1.1 and 1.2 drivers ?
> 
> >  6079 pts/0    Sl     0:08 /usr/sbin/opensm -d 3 -maxsmps 0 -s 300 -t 1000 -f /var/log/osm.log -V -g 0
> > ---
> 
> Can you try without infinite SMPs ? Is this how it was run before ?
> 
> -- Hal
> 
> > -cdm
> >
> 
> 
> 




More information about the general mailing list