[openib-general] another opensm crash

Eitan Zahavi eitan at mellanox.co.il
Sun Nov 20 05:31:20 PST 2005


Hi Hal,

To reproduce the problems we see in large subnets we have to revive the
simulator project. Yael will spend some time evolving the packet dropper
test on the simulator and I hope we will be able to reproduce this kind
of bugs.

The limit of the current test is that it only runs the standard sweep
without having any client doing path record, multicast and traps in
parallel.

EZ

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Sunday, November 20, 2005 3:05 PM
> To: Eitan Zahavi
> Cc: Troy Benjegerdes; openib-general at openib.org
> Subject: RE: [openib-general] another opensm crash
> 
> On Sun, 2005-11-20 at 04:52, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > > >
> > > > Try to move aside your /lib/tls directory and see if you still
get
> > these
> > > > crashes.
> > > > We have issues with TLS pthread and glibc
> > >
> > > There are still strange crashes like this which appear to be
memory
> > > scribbling issues.
> > [EZ] OK we need to trace those.
> 
> The problem will be recreating it now :-( This type of crash appeared
> numerous and varied as to where the scribbling occurred and how OpenSM
> crashed.
> 
> -- Hal
> 
> >  But TLS has some bugs too.
> > We had cases where we could see cond wait events not being picked
up.
> > >
> > > Moving tls aside changes the threads into processes. Does that
> > indicate
> > > that threading issues are suspected ?
> > [EZ] In old Pthread the threads seems like processes and in TLS they
do
> > not. This is not the issue. I suspect that in gen1 we see the cond
wait
> > issue more frequently as the vendor uses cl_timer more often (which
uses
> > cond wait ...)
> > >
> > > -- Hal
> > >
> > > >
> > > > Eitan Zahavi
> > > > Design Technology Director
> > > > Mellanox Technologies LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov]
> > > > > Sent: Monday, November 14, 2005 8:09 PM
> > > > > To: openib-general at openib.org
> > > > > Subject: [openib-general] another opensm crash
> > > > >
> > > > > (gdb) bt
> > > > > #0  0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138,
> > > > p_madw=0x80a1de0)
> > > > >     at osm_sw_info_rcv.c:679
> > > > > #1  0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at
> > > > > cl_dispatcher.c:108
> > > > > #2  0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4)
> > > > >     at cl_threadpool.c:78
> > > > > #3  0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at
> > > > cl_thread.c:61
> > > > > #4  0x46cde341 in start_thread () from
/lib/tls/libpthread.so.0
> > > > > #5  0x46b6e6fe in clone () from /lib/tls/libc.so.6
> > > > >
> > > > > _______________________________________________
> > > > > openib-general mailing list
> > > > > openib-general at openib.org
> > > > > http://openib.org/mailman/listinfo/openib-general
> > > > >
> > > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > > _______________________________________________
> > > > openib-general mailing list
> > > > openib-general at openib.org
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >



More information about the general mailing list