[openib-general] RE: [PATCH] Opensm - exiting issues

Eitan Zahavi eitan at mellanox.co.il
Mon Nov 7 08:28:04 PST 2005


We added it temporarily and removed it due to these problems.
Sorry for the misleading information regarding the close_port.

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Monday, November 07, 2005 4:55 PM
> To: Eitan Zahavi
> Cc: Yael Kalka; openib-general at openib.org
> Subject: RE: [PATCH] Opensm - exiting issues
> 
> On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > I will answer for Yael as she already left the office.
> >
> > The way to reproduce the "stuck" case is to run in bash:
> > % while test $? = 0; do opensm -V -o; done
> >
> > The symptom we see is that OpenSM sort of exists but the process
stay
> > active (not even defunct). No way to kill it. It seems like one of
the
> > threads gets caught in the middle of ioctl or something. To be able
to
> > run OpenSM after this we need to reboot the machine.
> >
> > We avoid it by not issuing umad_unregister and umad_close_port
> 
> I saw the change to not call umad_unregister in the patch. Where is
the
> change for umad_close_port ?
> 
> -- Hal
> 
> > Eitan Zahavi
> > Design Technology Director
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> > > -----Original Message-----
> > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > Sent: Monday, November 07, 2005 4:21 PM
> > > To: yael at mellanox.co.il
> > > Cc: openib-general at openib.org; eitan at mellanox.co.il
> > > Subject: Re: [PATCH] Opensm - exiting issues
> > >
> > > Hi Yael,
> > >
> > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote:
> > > > Hi Hal,
> > > >
> > > > There was a problem when running opensm with -o option, that
caused
> > > > the opensm to always exit with segfault, due to object
destruction
> > > > ordering. Also - there is the known issue of exiting opensm.
We've
> > > > done some clearing to the exiting code. The following patch
fixes
> > most
> > > > of it.
> > >
> > > I applied this part of the patch with some cosmetic changes in
> > > osm_vendor_ibumad.c.
> > >
> > > > In the current code we saw that sometimes opensm gets "stuck" on
> > exit,
> > > > and causes the machine to get stuck too - resulting in need for
> > > > rebooting. In the following patch fixes most of it.
> > > > We did run (in the patch) into rare cases where opensm exits
with an
> > > > error, but at least it exits without stucking the machine...
> > >
> > > Is there a reliable way to recreate machine "stuck" ? What exactly
do
> > > you mean by this ?
> > >
> > > All umad_unregister does is some validation, a table lookup, and
issue
> > > the ioctl to unregister the MAD agent. Not explictly unregistering
the
> > > agent(s) does not cause any harm as when the fd is closed, this
will
> > > occur as part of the cleanup.
> > >
> > > -- Hal
> >



More information about the general mailing list