[openib-general] OpenSM crash
Tom Duffy
tduffy at sun.com
Fri May 27 14:31:48 PDT 2005
On Fri, 2005-05-27 at 17:15 -0400, Hal Rosenstock wrote:
> On Fri, 2005-05-27 at 14:31, Tom Duffy wrote:
> > On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote:
> > > I just noticed that my opensm had segv'ed and dumped core.
> >
> > BTW, here was the tail of the osm.log:
> >
> > May 27 01:44:09 [43005960] -> osm_vendor_get: [
> > May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 (mad 0x5f33f0 req 1)
> > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x567908, size = 256.
> > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = 256.
> > May 27 01:44:09 [43005960] -> osm_vendor_get: ]
> > May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, p_mad = 0x5f3670, size = 256.
> > May 27 01:44:09 [43005960] -> osm_mad_pool_get: ]
> > May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), modifier = 0x10001, TID = 0x1c149.
> > May 27 01:44:09 [43005960] -> osm_vl15_post: [
> > May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad 0x5f3670 req 1)
> > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs outstanding.
> ^^^^^^^^^^
> This looks weird.
>
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: [
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread.
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: ]
> > May 27 01:44:09 [43005960] -> osm_vl15_post: ]
> > May 27 01:44:09 [43005960] -> osm_req_get: ]
> > May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ]
> > May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ]
> > May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ]
> > May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [
>
> Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack
> shown in the previous email as this makes it look like it should be.
>
> Could you go back a little further in the log ? I'd like to see what is
> before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and
> osm_pi_rcv_process.
The log had grown to almost 1G, so I actually deleted it. Shit, sorry.
> It's also seems weird to me that there is no other
> log message between these two.
>
> >From the stack trace:
> #3 osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
> at osm_helper.c:1446
> #4 0x000000000042eed1 in __osm_vl15_poller (p_ptr=0x552498) at
> osm_madw.h:575
>
> It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller
>
> if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) )
> {
> if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
> {
> osm_log( p_vl->p_log, OSM_LOG_DEBUG,
> "__osm_vl15_poller: "
> "Servicing p_madw = %p (mad %p req %d)\n",
> p_madw, p_madw->p_mad, p_madw->resp_expected);
> }
>
> if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) )
> {
> osm_dump_dr_smp( p_vl->p_log,
> osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES ); <=== here
> }
>
> when it died but I didn't see the previous log message in the code
> "osm_vl15_poller: Servicing p_madw" which I also would have expected.
> [This would have been telling as p_madw->p_mad would have been logged].
> I also didn't see the __osm_vl15_poller entry message either.
well, if it segv'ed maybe it never finished writing out to the file...
-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050527/8c00921f/attachment.sig>
More information about the general
mailing list