[openib-general] RE: [PATCH] Opensm - exiting issues

Yael Kalka yael at mellanox.co.il
Mon Nov 7 22:37:44 PST 2005


Hi Hal,

Just another comment - when running:
% while test $? = 0; do opensm -V -o; done
Try to run from a different port:
% osmtest -f f 
This causes fludding of mads to the opensm, and that usually is 
the cause for the exiting problem.

Yael

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com]
Sent: Monday, November 07, 2005 10:07 PM
To: Eitan Zahavi
Cc: Yael Kalka; openib-general at openib.org
Subject: RE: [PATCH] Opensm - exiting issues


On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote:
> Hi Hal,
> 
> I will answer for Yael as she already left the office.
> 
> The way to reproduce the "stuck" case is to run in bash:
> % while test $? = 0; do opensm -V -o; done
> 
> The symptom we see is that OpenSM sort of exists but the process stay
> active (not even defunct). No way to kill it. It seems like one of the
> threads gets caught in the middle of ioctl or something. To be able to
> run OpenSM after this we need to reboot the machine.
> 
> We avoid it by not issuing umad_unregister and umad_close_port

This part of the patch is not needed with the fix to user_mad put in by
Roland based on the issue (and patch) from Michael on user_mad deadlock.

I've been running your test from over 30 minutes now without a hiccup.
It used to fail pretty quickly.

-- Hal

> 
> Eitan Zahavi
> Design Technology Director
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Monday, November 07, 2005 4:21 PM
> > To: yael at mellanox.co.il
> > Cc: openib-general at openib.org; eitan at mellanox.co.il
> > Subject: Re: [PATCH] Opensm - exiting issues
> > 
> > Hi Yael,
> > 
> > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote:
> > > Hi Hal,
> > >
> > > There was a problem when running opensm with -o option, that
caused
> > > the opensm to always exit with segfault, due to object destruction
> > > ordering. Also - there is the known issue of exiting opensm. We've
> > > done some clearing to the exiting code. The following patch fixes
> most
> > > of it.
> > 
> > I applied this part of the patch with some cosmetic changes in
> > osm_vendor_ibumad.c.
> > 
> > > In the current code we saw that sometimes opensm gets "stuck" on
> exit,
> > > and causes the machine to get stuck too - resulting in need for
> > > rebooting. In the following patch fixes most of it.
> > > We did run (in the patch) into rare cases where opensm exits with
an
> > > error, but at least it exits without stucking the machine...
> > 
> > Is there a reliable way to recreate machine "stuck" ? What exactly
do
> > you mean by this ?
> > 
> > All umad_unregister does is some validation, a table lookup, and
issue
> > the ioctl to unregister the MAD agent. Not explictly unregistering
the
> > agent(s) does not cause any harm as when the fd is closed, this will
> > occur as part of the cleanup.
> > 
> > -- Hal
> 



More information about the general mailing list