[openib-general] RE: [PATCH] Opensm - exiting issues
Yael Kalka
yael at mellanox.co.il
Tue Nov 8 04:02:17 PST 2005
Hi Hal,
The filesystem is not full, since I am using opensm with -e and with no verbosity.
swlab53:~ # df -k /var/log/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 8262068 4705692 3136680 61% /
Yael
-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com]
Sent: Tuesday, November 08, 2005 1:53 PM
To: yael at mellanox.co.il
Cc: openib-general at openib.org; eitan at mellanox.co.il
Subject: RE: [PATCH] Opensm - exiting issues
Hi Yael,
On Tue, 2005-11-08 at 05:12, Yael Kalka wrote:
> Hi Hal,
>
> It seems that there is still another race somewhere.
> The situation is much better. I had to run the testing for
> ~45 minutes in order to see the problem.
Is your filesystem full ? What is the file size of the log when you hit
this ? Is this a max file size issue ?
-- Hal
> I ran on a loopback machine the following:
> a) from port #2
> % while test $? = 0; do opensm -o -e; done
> b) from port #1
> % while test 1 = 1; do osmtest -f f; done
>
> The process is hang. When getting the process with ps -efww I get:
> root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] <defunct>
> root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g
> 0x2c902000017a2
>
> Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp
>
> lsmod reports the following:
> Module Size Used by
> subfs 12416 1
> nvram 13576 0
> usbserial 34024 0
> autofs4 23556 2
> speedstep_lib 8324 0
> freq_table 8832 0
> thermal 18184 0
> processor 28648 1 thermal
> ipv6 273920 20
> fan 8836 0
> button 11024 0
> battery 14084 0
> ac 9220 0
> edd 14560 0
> evdev 12928 0
> joydev 13888 0
> st 43676 0
> sr_mod 21284 0
> ib_ipoib 44804 0
> ib_sa 16652 1 ib_ipoib
> ib_uverbs 37416 0
> ib_umad 19376 2
> af_packet 26760 4
> sg 42912 0
> ib_mthca 119452 0
> ib_mad 41620 3 ib_sa,ib_umad,ib_mthca
> ib_core 48000 6
> ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad
> e1000 91316 0
> e100 43392 0
> mii 9088 1 e100
> i2c_i801 12556 0
> i2c_core 26624 1 i2c_i801
> uhci_hcd 37008 0
> usbcore 121688 3 usbserial,uhci_hcd
> parport_pc 44356 0
> lp 15396 0
> parport 40392 2 parport_pc,lp
> video1394 22860 0
> ohci1394 37508 1 video1394
> raw1394 34540 0
> ieee1394 108472 3 video1394,ohci1394,raw1394
> capability 7224 0
> nls_iso8859_1 8064 1
> nls_cp437 9728 1
> vfat 17792 1
> fat 43804 1 vfat
> dm_mod 64768 0
> ext3 145032 2
> jbd 73764 1 ext3
> ide_cd 44036 0
> cdrom 42784 2 sr_mod,ide_cd
> ide_disk 22400 0
> aic7xxx 200632 4
> piix 14468 0 [permanent]
> ide_core 131904 3 ide_cd,ide_disk,piix
> sd_mod 23168 5
> scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod
>
> Thanks,
> Yael
>
>
>
> -----Original Message-----
> From: Yael Kalka
> Sent: Tuesday, November 08, 2005 8:38 AM
> To: 'Hal Rosenstock'; Eitan Zahavi
> Cc: Yael Kalka; openib-general at openib.org
> Subject: RE: [PATCH] Opensm - exiting issues
>
>
> Hi Hal,
>
> Just another comment - when running:
> % while test $? = 0; do opensm -V -o; done
> Try to run from a different port:
> % osmtest -f f
> This causes fludding of mads to the opensm, and that usually is
> the cause for the exiting problem.
>
> Yael
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Monday, November 07, 2005 10:07 PM
> To: Eitan Zahavi
> Cc: Yael Kalka; openib-general at openib.org
> Subject: RE: [PATCH] Opensm - exiting issues
>
>
> On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > I will answer for Yael as she already left the office.
> >
> > The way to reproduce the "stuck" case is to run in bash:
> > % while test $? = 0; do opensm -V -o; done
> >
> > The symptom we see is that OpenSM sort of exists but the process stay
> > active (not even defunct). No way to kill it. It seems like one of the
> > threads gets caught in the middle of ioctl or something. To be able to
> > run OpenSM after this we need to reboot the machine.
> >
> > We avoid it by not issuing umad_unregister and umad_close_port
>
> This part of the patch is not needed with the fix to user_mad put in by
> Roland based on the issue (and patch) from Michael on user_mad deadlock.
>
> I've been running your test from over 30 minutes now without a hiccup.
> It used to fail pretty quickly.
>
> -- Hal
>
> >
> > Eitan Zahavi
> > Design Technology Director
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> > > -----Original Message-----
> > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > Sent: Monday, November 07, 2005 4:21 PM
> > > To: yael at mellanox.co.il
> > > Cc: openib-general at openib.org; eitan at mellanox.co.il
> > > Subject: Re: [PATCH] Opensm - exiting issues
> > >
> > > Hi Yael,
> > >
> > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote:
> > > > Hi Hal,
> > > >
> > > > There was a problem when running opensm with -o option, that
> caused
> > > > the opensm to always exit with segfault, due to object destruction
> > > > ordering. Also - there is the known issue of exiting opensm. We've
> > > > done some clearing to the exiting code. The following patch fixes
> > most
> > > > of it.
> > >
> > > I applied this part of the patch with some cosmetic changes in
> > > osm_vendor_ibumad.c.
> > >
> > > > In the current code we saw that sometimes opensm gets "stuck" on
> > exit,
> > > > and causes the machine to get stuck too - resulting in need for
> > > > rebooting. In the following patch fixes most of it.
> > > > We did run (in the patch) into rare cases where opensm exits with
> an
> > > > error, but at least it exits without stucking the machine...
> > >
> > > Is there a reliable way to recreate machine "stuck" ? What exactly
> do
> > > you mean by this ?
> > >
> > > All umad_unregister does is some validation, a table lookup, and
> issue
> > > the ioctl to unregister the MAD agent. Not explictly unregistering
> the
> > > agent(s) does not cause any harm as when the fd is closed, this will
> > > occur as part of the cleanup.
> > >
> > > -- Hal
> >
More information about the general
mailing list