[ewg] OpenSM from ofed-1.2 and ofed-1.3 clients
Hal Rosenstock
hrosenstock at xsigo.com
Wed Jun 4 15:03:05 PDT 2008
Hi Jim,
On Wed, 2008-06-04 at 15:49 -0600, Jim Schutt wrote:
> Hi Hal,
>
> On Wed, 2008-06-04 at 14:40 -0600, Hal Rosenstock wrote:
> > Hi Jim,
> >
> > On Wed, 2008-06-04 at 13:59 -0600, Jim Schutt wrote:
> > > Hi Hal,
> > >
> > > I've just discovered that what I thought was ofed-1.2 opensm
> > > is really ofed-1.2-rc2, if it matters.
> >
> > I don't recall what the differences were but let's assume it's not
> > significant for now.
>
> OK.
>
> [snip]
>
> > OK; this clearly shows there's only 1 SM at a time here.
> >
> > Did this exact cluster work fine (with OFED 1.2 (rc2) OpenSM) when the
> > end nodes were OFED 1.2 rather than 1.3 and that was the only change ?
>
> We ran for months with the 1.2-rc2 opensm and
> 1.2.5 clients on RHEL4.
>
> Then we updated the clients to RHEL5 and ofed-1.3
> a little while ago.
>
> >
> > Did the cluster size change too by any chance ?
>
> Nope.
>
> > How large a cluster is
> > this ? (There were some fixes here which should help for OFED 1.3).
>
> 128 nodes, 2 CPUs each.
>
> >
> > What opensm command line and config file options are being used to start
> > the OpenSMs ?
>
> The 1.3 OpenSM is using "opensm -t 200 -g 0" with these config
> file options:
>
> DEBUG=none
> LMC=0
> MAXSMPS=4
> REASSIGN_LIDS="no"
> SWEEP=10
> TIMEOUT=200
> OSM_LOG=/var/log/opensm.log
> VERBOSE="none"
> UPDN="off"
> GUID_FILE="none"
> GUID=0
> OSM_HOSTS=""
> OSM_CACHE_DIR=/var/cache/opensm
> CACHE_OPTIONS="none"
> HONORE_GUID2LID="none"
> RCP=/usr/bin/scp
> RSH=/usr/bin/ssh
> RESCAN_TIME=60
> PORT_NUM=1
> ONBOOT=no
>
> while the 1.2-rc2 is using "opensm -maxsmps 0 -t 200 -g 0" with
> these config file options:
>
> DEBUG=none
> LMC=0
> MAXSMPS=0
> REASSIGN_LIDS="no"
> SWEEP=10
> TIMEOUT=200
> OSM_LOG=/tmp/osm.log
> VERBOSE="none"
> UPDN="off"
> GUID_FILE="none"
> GUID=0
> OSM_HOSTS=""
> OSM_CACHE_DIR=/var/cache/osm
> CACHE_OPTIONS="none"
> HONORE_GUID2LID="none"
> RCP=/usr/bin/scp
> RSH=/usr/bin/ssh
> RESCAN_TIME=60
> PORT_NUM=1
> ONBOOT=no
>
> So from your other email, the "-maxsmps 0" is problematic?
In a large subnet; not sure yours is large enough where that might cause
this but you can dial this back to say 15 and see if that makes a
difference.
> > Are you trying to stick with the OFED 1.2 (rc2) OpenSM or would the OFED
> > 1.3 OpenSM be OK if it worked in your environment ?
>
> The 1.3 is OK. We run diskless, and in the process of updating
> the clients it was easiest to leave the admin node, which was
> running the OpenSM, alone at 1.2-rc2. I can run the sm on one of
> the compute nodes until I have a chance to update the admin
> node as well.
>
> It's just that it was unexpected to see the logs spammed with
> those errors, and it seemed like a good idea to report it.
OpenSM doesn't do a good job of filtering out repetitive errors like
this but this type of error is unusual.
> Maybe there's some lurking issue that hasn't shown up any
> other way yet?
I'm hoping it's the maxsmp issue. Can you change it from infinite and
see ?
> In any event, the OpenSM from 1.3 seems to be working
> just fine for us.
That's good to know. There were many improvements made for this (from
1.2).
-- Hal
> > Sorry for all the questions but I'm trying to come up with a theory on
> > what's not right.
>
> No worries. Thanks for checking into it.
>
> -- Jim
>
> >
> > -- Hal
>
>
>
More information about the ewg
mailing list