[openib-general] another opensm crash

Hal Rosenstock halr at voltaire.com
Sun Nov 20 05:05:01 PST 2005


On Sun, 2005-11-20 at 04:52, Eitan Zahavi wrote:
> Hi Hal,
> 
> > >
> > > Try to move aside your /lib/tls directory and see if you still get
> these
> > > crashes.
> > > We have issues with TLS pthread and glibc
> > 
> > There are still strange crashes like this which appear to be memory
> > scribbling issues.
> [EZ] OK we need to trace those.

The problem will be recreating it now :-( This type of crash appeared
numerous and varied as to where the scribbling occurred and how OpenSM
crashed.

-- Hal

>  But TLS has some bugs too.
> We had cases where we could see cond wait events not being picked up.
> > 
> > Moving tls aside changes the threads into processes. Does that
> indicate
> > that threading issues are suspected ?
> [EZ] In old Pthread the threads seems like processes and in TLS they do
> not. This is not the issue. I suspect that in gen1 we see the cond wait
> issue more frequently as the vendor uses cl_timer more often (which uses
> cond wait ...)
> > 
> > -- Hal
> > 
> > >
> > > Eitan Zahavi
> > > Design Technology Director
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > > > -----Original Message-----
> > > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov]
> > > > Sent: Monday, November 14, 2005 8:09 PM
> > > > To: openib-general at openib.org
> > > > Subject: [openib-general] another opensm crash
> > > >
> > > > (gdb) bt
> > > > #0  0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138,
> > > p_madw=0x80a1de0)
> > > >     at osm_sw_info_rcv.c:679
> > > > #1  0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at
> > > > cl_dispatcher.c:108
> > > > #2  0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4)
> > > >     at cl_threadpool.c:78
> > > > #3  0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at
> > > cl_thread.c:61
> > > > #4  0x46cde341 in start_thread () from /lib/tls/libpthread.so.0
> > > > #5  0x46b6e6fe in clone () from /lib/tls/libc.so.6
> > > >
> > > > _______________________________________________
> > > > openib-general mailing list
> > > > openib-general at openib.org
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 




More information about the general mailing list