[openib-general] RE: [PATCH] Opensm - exiting issues

Eitan Zahavi eitan at mellanox.co.il
Mon Nov 7 06:42:29 PST 2005


Hi Hal,

I will answer for Yael as she already left the office.

The way to reproduce the "stuck" case is to run in bash:
% while test $? = 0; do opensm -V -o; done

The symptom we see is that OpenSM sort of exists but the process stay
active (not even defunct). No way to kill it. It seems like one of the
threads gets caught in the middle of ioctl or something. To be able to
run OpenSM after this we need to reboot the machine.

We avoid it by not issuing umad_unregister and umad_close_port

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Monday, November 07, 2005 4:21 PM
> To: yael at mellanox.co.il
> Cc: openib-general at openib.org; eitan at mellanox.co.il
> Subject: Re: [PATCH] Opensm - exiting issues
> 
> Hi Yael,
> 
> On Mon, 2005-11-07 at 08:25, Yael Kalka wrote:
> > Hi Hal,
> >
> > There was a problem when running opensm with -o option, that caused
> > the opensm to always exit with segfault, due to object destruction
> > ordering. Also - there is the known issue of exiting opensm. We've
> > done some clearing to the exiting code. The following patch fixes
most
> > of it.
> 
> I applied this part of the patch with some cosmetic changes in
> osm_vendor_ibumad.c.
> 
> > In the current code we saw that sometimes opensm gets "stuck" on
exit,
> > and causes the machine to get stuck too - resulting in need for
> > rebooting. In the following patch fixes most of it.
> > We did run (in the patch) into rare cases where opensm exits with an
> > error, but at least it exits without stucking the machine...
> 
> Is there a reliable way to recreate machine "stuck" ? What exactly do
> you mean by this ?
> 
> All umad_unregister does is some validation, a table lookup, and issue
> the ioctl to unregister the MAD agent. Not explictly unregistering the
> agent(s) does not cause any harm as when the fd is closed, this will
> occur as part of the cleanup.
> 
> -- Hal




More information about the general mailing list