[openib-general] RE: [PATCH] Opensm - exiting issues

Yael Kalka yael at mellanox.co.il
Tue Nov 8 04:56:19 PST 2005


Nothing:
swlab53:~ # ls -lasg /var/log/osm.log
4 -rw-r--r--  1 root 724 Nov  8 11:40 /var/log/osm.log

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com]
Sent: Tuesday, November 08, 2005 2:49 PM
To: yael at mellanox.co.il
Cc: openib-general at openib.org; eitan at mellanox.co.il
Subject: RE: [PATCH] Opensm - exiting issues


On Tue, 2005-11-08 at 07:02, Yael Kalka wrote:
> Hi Hal,
>
> The filesystem is not full, since I am using opensm with -e and with no verbosity.
>
> swlab53:~ # df -k /var/log/
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sda3              8262068   4705692   3136680  61% /

How large is the osm.log file (ls -lasg) when this occurs ?

-- Hal

>
> Yael
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Tuesday, November 08, 2005 1:53 PM
> To: yael at mellanox.co.il
> Cc: openib-general at openib.org; eitan at mellanox.co.il
> Subject: RE: [PATCH] Opensm - exiting issues
>
>
> Hi Yael,
>
> On Tue, 2005-11-08 at 05:12, Yael Kalka wrote:
> > Hi Hal,
> >
> > It seems that there is still another race somewhere.
> > The situation is much better. I had to run the testing for
> > ~45 minutes in order to see the problem.
>
> Is your filesystem full ? What is the file size of the log when you hit
> this ? Is this a max file size issue ?
>
> -- Hal
>
> > I ran on a loopback machine the following:
> > a) from port #2
> > % while test $? = 0; do opensm -o -e; done
> > b) from port #1
> > % while test 1 = 1; do osmtest -f f; done
> >
> > The process is hang. When getting the process with ps -efww I get:
> > root     27939 27938  0 11:40 pts/0    00:00:00 [opensm] <defunct>
> > root     27938  8001  0 11:40 pts/0    00:00:00 usr/bin/opensm -o -e -g
> > 0x2c902000017a2
> >
> > Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp
> >
> > lsmod reports the following:
> > Module                  Size  Used by
> > subfs                  12416  1
> > nvram                  13576  0
> > usbserial              34024  0
> > autofs4                23556  2
> > speedstep_lib           8324  0
> > freq_table              8832  0
> > thermal                18184  0
> > processor              28648  1 thermal
> > ipv6                  273920  20
> > fan                     8836  0
> > button                 11024  0
> > battery                14084  0
> > ac                      9220  0
> > edd                    14560  0
> > evdev                  12928  0
> > joydev                 13888  0
> > st                     43676  0
> > sr_mod                 21284  0
> > ib_ipoib               44804  0
> > ib_sa                  16652  1 ib_ipoib
> > ib_uverbs              37416  0
> > ib_umad                19376  2
> > af_packet              26760  4
> > sg                     42912  0
> > ib_mthca              119452  0
> > ib_mad                 41620  3 ib_sa,ib_umad,ib_mthca
> > ib_core                48000  6
> > ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad
> > e1000                  91316  0
> > e100                   43392  0
> > mii                     9088  1 e100
> > i2c_i801               12556  0
> > i2c_core               26624  1 i2c_i801
> > uhci_hcd               37008  0
> > usbcore               121688  3 usbserial,uhci_hcd
> > parport_pc             44356  0
> > lp                     15396  0
> > parport                40392  2 parport_pc,lp
> > video1394              22860  0
> > ohci1394               37508  1 video1394
> > raw1394                34540  0
> > ieee1394              108472  3 video1394,ohci1394,raw1394
> > capability              7224  0
> > nls_iso8859_1           8064  1
> > nls_cp437               9728  1
> > vfat                   17792  1
> > fat                    43804  1 vfat
> > dm_mod                 64768  0
> > ext3                  145032  2
> > jbd                    73764  1 ext3
> > ide_cd                 44036  0
> > cdrom                  42784  2 sr_mod,ide_cd
> > ide_disk               22400  0
> > aic7xxx               200632  4
> > piix                   14468  0 [permanent]
> > ide_core              131904  3 ide_cd,ide_disk,piix
> > sd_mod                 23168  5
> > scsi_mod              136008  5 st,sr_mod,sg,aic7xxx,sd_mod
> >
> > Thanks,
> > Yael
> >
> >
> >
> > -----Original Message-----
> > From: Yael Kalka
> > Sent: Tuesday, November 08, 2005 8:38 AM
> > To: 'Hal Rosenstock'; Eitan Zahavi
> > Cc: Yael Kalka; openib-general at openib.org
> > Subject: RE: [PATCH] Opensm - exiting issues
> >
> >
> > Hi Hal,
> >
> > Just another comment - when running:
> > % while test $? = 0; do opensm -V -o; done
> > Try to run from a different port:
> > % osmtest -f f
> > This causes fludding of mads to the opensm, and that usually is
> > the cause for the exiting problem.
> >
> > Yael
> >
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Monday, November 07, 2005 10:07 PM
> > To: Eitan Zahavi
> > Cc: Yael Kalka; openib-general at openib.org
> > Subject: RE: [PATCH] Opensm - exiting issues
> >
> >
> > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote:
> > > Hi Hal,
> > >
> > > I will answer for Yael as she already left the office.
> > >
> > > The way to reproduce the "stuck" case is to run in bash:
> > > % while test $? = 0; do opensm -V -o; done
> > >
> > > The symptom we see is that OpenSM sort of exists but the process stay
> > > active (not even defunct). No way to kill it. It seems like one of the
> > > threads gets caught in the middle of ioctl or something. To be able to
> > > run OpenSM after this we need to reboot the machine.
> > >
> > > We avoid it by not issuing umad_unregister and umad_close_port
> >
> > This part of the patch is not needed with the fix to user_mad put in by
> > Roland based on the issue (and patch) from Michael on user_mad deadlock.
> >
> > I've been running your test from over 30 minutes now without a hiccup.
> > It used to fail pretty quickly.
> >
> > -- Hal
> >
> > >
> > > Eitan Zahavi
> > > Design Technology Director
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > > > -----Original Message-----
> > > > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > > > Sent: Monday, November 07, 2005 4:21 PM
> > > > To: yael at mellanox.co.il
> > > > Cc: openib-general at openib.org; eitan at mellanox.co.il
> > > > Subject: Re: [PATCH] Opensm - exiting issues
> > > >
> > > > Hi Yael,
> > > >
> > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote:
> > > > > Hi Hal,
> > > > >
> > > > > There was a problem when running opensm with -o option, that
> > caused
> > > > > the opensm to always exit with segfault, due to object destruction
> > > > > ordering. Also - there is the known issue of exiting opensm. We've
> > > > > done some clearing to the exiting code. The following patch fixes
> > > most
> > > > > of it.
> > > >
> > > > I applied this part of the patch with some cosmetic changes in
> > > > osm_vendor_ibumad.c.
> > > >
> > > > > In the current code we saw that sometimes opensm gets "stuck" on
> > > exit,
> > > > > and causes the machine to get stuck too - resulting in need for
> > > > > rebooting. In the following patch fixes most of it.
> > > > > We did run (in the patch) into rare cases where opensm exits with
> > an
> > > > > error, but at least it exits without stucking the machine...
> > > >
> > > > Is there a reliable way to recreate machine "stuck" ? What exactly
> > do
> > > > you mean by this ?
> > > >
> > > > All umad_unregister does is some validation, a table lookup, and
> > issue
> > > > the ioctl to unregister the MAD agent. Not explictly unregistering
> > the
> > > > agent(s) does not cause any harm as when the fd is closed, this will
> > > > occur as part of the cleanup.
> > > >
> > > > -- Hal
> > >
>




More information about the general mailing list