[ofa-general] Re: pkey.sim.tcl

Sasha Khapyorsky sashak at voltaire.com
Sat Jul 28 14:55:27 PDT 2007


Hi Eitan,

On 07:56 Fri 27 Jul     , Eitan Zahavi wrote:
> > 
> > On 09:26 Thu 26 Jul     , Eitan Zahavi wrote:
> > > 
> > > I am happy you actually use the simulator.
> > > Please provide more info regarding the failure. You should tar 
> > > compress the /tmp/ibmgtsim.XXXX of your run.
> > 
> > I can send this for you if you want, but the failure is trivial.
> No need if you already know where the bug is...
> > 
> > Yes, and it is due (6), where default Pkey is removed 
> > "externally". I'm not sure that OpenSM should handle the case 
> > when pkey table is modified externally by something which is not SM.
> > 
> 
> For a few years it just worked fine. So I wonder why this fucntionality
> was removed ?
> It is a real BAD case where Pkeys are altered but I think would be wise
> to "refresh" these tables on heavy seep.

We discussed how and when port tables refresh should be done just few
days ago in this thread. My impression was that we are "in sync" about
this.

> In general it seems OpenSM has lost its "heavy sweep" concept. Now it
> does not refresh the fabric setup even on heavy sweep.

Not on each heavy sweep, but it does when it needed or when data could
change. I don't think the concept was changed, just optimized. Let just
look at the numbers:

$ time ./opensm/opensm -e -f ./osm.log -o
...
SUBNET UP
Exiting SM

real    0m7.995s
user    0m4.488s
sys     0m6.072s

$ time ./opensm/opensm -e -f ./osm.log -o --qos
...
SUBNET UP
Exiting SM

real    0m22.521s
user    0m10.921s
sys     0m17.173s


This is simulated runs (with ibsim), the fabric is ~1300 nodes.

The difference there is '--qos' flag, so OpenSM skips SL2VL and VLArb
update in first run and does it in the second - sweep times are 8
against 22 seconds.

> This is assuming a "perfect" HW and software and I would really this we
> should have preserved that capability.

What about an option? Now with subn->need_update flag (which always
enforces updates) it is trivial to implement.

> Note that a "heavy sweep" does not happen unless somethng changed or
> trapped.

Yes, for example some port was connected/disconnected, some node
rebooted, etc.. OpenSM starts huge heavy sweep, it takes a while, SA
is not responsive most the time, TCP connection over IPoIB timeouted,
applications failed. This is production experiences... :(

Sasha



More information about the general mailing list