[ewg] OpenSM from ofed-1.2 and ofed-1.3 clients

Hal Rosenstock hrosenstock at xsigo.com
Wed Jun 4 15:03:05 PDT 2008


Hi Jim,

On Wed, 2008-06-04 at 15:49 -0600, Jim Schutt wrote:
> Hi Hal,
> 
> On Wed, 2008-06-04 at 14:40 -0600, Hal Rosenstock wrote:
> > Hi Jim,
> > 
> > On Wed, 2008-06-04 at 13:59 -0600, Jim Schutt wrote:
> > > Hi Hal,
> > >
> > > I've just discovered that what I thought was ofed-1.2 opensm
> > > is really ofed-1.2-rc2, if it matters.
> > 
> > I don't recall what the differences were but let's assume it's not
> > significant for now.
> 
> OK.
> 
> [snip]
> 
> > OK; this clearly shows there's only 1 SM at a time here.
> > 
> > Did this exact cluster work fine (with OFED 1.2 (rc2) OpenSM) when the
> > end nodes were OFED 1.2 rather than 1.3 and that was the only change ?
> 
> We ran for months with the 1.2-rc2 opensm and 
> 1.2.5 clients on RHEL4.
> 
> Then we updated the clients to RHEL5 and ofed-1.3
> a little while ago.
> 
> > 
> > Did the cluster size change too by any chance ? 
> 
> Nope.
> 
> > How large a cluster is
> > this ? (There were some fixes here which should help for OFED 1.3).
> 
> 128 nodes, 2 CPUs each.
> 
> > 
> > What opensm command line and config file options are being used to start
> > the OpenSMs ?
> 
> The 1.3 OpenSM is using "opensm -t 200 -g 0" with these config
> file options:
> 
> DEBUG=none
> LMC=0
> MAXSMPS=4
> REASSIGN_LIDS="no"
> SWEEP=10
> TIMEOUT=200
> OSM_LOG=/var/log/opensm.log
> VERBOSE="none"
> UPDN="off"
> GUID_FILE="none"
> GUID=0
> OSM_HOSTS=""
> OSM_CACHE_DIR=/var/cache/opensm
> CACHE_OPTIONS="none"
> HONORE_GUID2LID="none"
> RCP=/usr/bin/scp
> RSH=/usr/bin/ssh
> RESCAN_TIME=60
> PORT_NUM=1
> ONBOOT=no
> 
> while the 1.2-rc2 is using "opensm -maxsmps 0 -t 200 -g 0" with
> these config file options:
> 
> DEBUG=none
> LMC=0
> MAXSMPS=0
> REASSIGN_LIDS="no"
> SWEEP=10
> TIMEOUT=200
> OSM_LOG=/tmp/osm.log
> VERBOSE="none"
> UPDN="off"
> GUID_FILE="none"
> GUID=0
> OSM_HOSTS=""
> OSM_CACHE_DIR=/var/cache/osm
> CACHE_OPTIONS="none"
> HONORE_GUID2LID="none"
> RCP=/usr/bin/scp
> RSH=/usr/bin/ssh
> RESCAN_TIME=60
> PORT_NUM=1
> ONBOOT=no
> 
> So from your other email, the "-maxsmps 0" is problematic?

In a large subnet; not sure yours is large enough where that might cause
this but you can dial this back to say 15 and see if that makes a
difference.

> > Are you trying to stick with the OFED 1.2 (rc2) OpenSM or would the OFED
> > 1.3 OpenSM be OK if it worked in your environment ?
> 
> The 1.3 is OK.  We run diskless, and in the process of updating
> the clients it was easiest to leave the admin node, which was 
> running the OpenSM, alone at 1.2-rc2.  I can run the sm on one of
> the compute nodes until I have a chance to update the admin
> node as well.
> 
> It's just that it was unexpected to see the logs spammed with
> those errors, and it seemed like a good idea to report it.

OpenSM doesn't do a good job of filtering out repetitive errors like
this but this type of error is unusual.

> Maybe there's some lurking issue that hasn't shown up any
> other way yet?

I'm hoping it's the maxsmp issue. Can you change it from infinite and
see ?

> In any event, the OpenSM from 1.3 seems to be working
> just fine for us.

That's good to know. There were many improvements made for this (from
1.2).

-- Hal

> > Sorry for all the questions but I'm trying to come up with a theory on
> > what's not right.
> 
> No worries.  Thanks for checking into it.
> 
> -- Jim
> 
> > 
> > -- Hal
> 
> 
> 




More information about the ewg mailing list