[ofw] FW: Opensm or umad bug

Alex Naslednikov xalex at mellanox.co.il
Thu Apr 28 05:10:35 PDT 2011


Hello Sean,
We have the following code commented out at umad_receiver_stop:
        /* XXX hangs current thread - suspect umad_recv() ignoring wakeup.
        cl_thread_destroy(&p_ur->tid);
        */
How can one ensure that umad_receiver thread will not run after osm_vendor_delete was called ?
Please, the e-mails below for more details

XaleX

From: Hal Rosenstock
Sent: Wednesday, April 27, 2011 4:10 PM
To: Alex Naslednikov; OpenSM; Gilad Margalit; Uri Habusha
Subject: RE: Opensm bug

PSB [HNR]

From: Alex Naslednikov
Sent: Wednesday, April 27, 2011 9:08 AM
To: Hal Rosenstock; OpenSM; Gilad Margalit; Uri Habusha
Subject: RE: Opensm bug

Hal,
Thank you for the fast response
I see that the code of umad_receiver_stop is the same for 2.3.0 (trunk), so cl_thread_destroy will be called.
Can you please explain while umad_receiver_thread can continue running in this case ?
[HNR] Huh ? Isn't that code commented out ?

From: Hal Rosenstock
Sent: Wednesday, April 27, 2011 3:56 PM
To: Alex Naslednikov; OpenSM; Gilad Margalit; Uri Habusha
Subject: RE: Opensm bug

PSB [HNR]

From: Alex Naslednikov
Sent: Wednesday, April 27, 2011 8:10 AM
To: OpenSM; Gilad Margalit; Uri Habusha
Subject: Opensm bug

Hi all,
Recently we got the following assert:

ASSERT happened:  &p_log->lock is not initialized
complibd!cl_spinlock_acquire+0x39 [s:\builds\7789\trunk\inc\user\complib\cl_spinlock_osd.h @ 107]
opensm!osm_log+0x1b8 [s:\builds\7789\trunk\ulp\opensm\user\opensm\osm_log.c @ 171]
opensm!osm_vendor_get+0x174 [s:\builds\7789\trunk\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 1007]
opensm!osm_mad_pool_get+0xbb [s:\builds\7789\trunk\ulp\opensm\user\opensm\osm_mad_pool.c @ 95]
opensm!umad_receiver+0x3b4 [s:\builds\7789\trunk\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 314]
complibd!cl_thread_callback+0x1a [s:\builds\7789\trunk\core\complib\user\cl_thread.c @ 49]

Did somebody see this problem before ?
[HNR] FWIW I haven't.

Can it happen that umad_receiver tries to access opensm when destroy process was already started ?
[HNR] osm_vendor_delete is called prior to destroying the log (in osm_opensm_destory) so I wouldn't think that should be the case. However, looking at some perhaps older version of osm_vendor_ibumad.c for Windows (MLNX OFED 2.1.3), I see:

static void umad_receiver_stop(umad_receiver_t * p_ur)
{
#ifdef HAVE_LIBPTHREADS
        pthread_cancel(p_ur->tid);
        pthread_join(p_ur->tid, NULL);
        p_ur->tid = 0;
#else
        /* XXX hangs current thread - suspect umad_recv() ignoring wakeup.
        cl_thread_destroy(&p_ur->tid);
        */
#endif

I don't know if that's still the case but that looks to me like it could result in umad_receiver thread still running after the log is destroyed :(

There are other unrelated problems in the Windows implementation of that file too (e.g. osm_vendor_set_sm is unimplemented which is problematic to multi SM operation).


n  Hal

Alexander (XaleX) Naslednikov
SW Networking Team
Mellanox Technologies

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20110428/077dd1d0/attachment.html>


More information about the ofw mailing list