[openib-general] OpenSM crash

Tom Duffy tduffy at sun.com
Fri May 27 14:31:48 PDT 2005


On Fri, 2005-05-27 at 17:15 -0400, Hal Rosenstock wrote:
> On Fri, 2005-05-27 at 14:31, Tom Duffy wrote:
> > On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote:
> > > I just noticed that my opensm had segv'ed and dumped core.
> > 
> > BTW, here was the tail of the osm.log:
> > 
> > May 27 01:44:09 [43005960] -> osm_vendor_get: [
> > May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 (mad 0x5f33f0 req 1)
> > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x567908, size = 256.
> > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = 256.
> > May 27 01:44:09 [43005960] -> osm_vendor_get: ]
> > May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, p_mad = 0x5f3670, size = 256.
> > May 27 01:44:09 [43005960] -> osm_mad_pool_get: ]
> > May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), modifier = 0x10001, TID = 0x1c149.
> > May 27 01:44:09 [43005960] -> osm_vl15_post: [
> > May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad 0x5f3670 req 1)
> > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs outstanding.
>                                                ^^^^^^^^^^
> This looks weird.
> 
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: [
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread.
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: ]
> > May 27 01:44:09 [43005960] -> osm_vl15_post: ]
> > May 27 01:44:09 [43005960] -> osm_req_get: ]
> > May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ]
> > May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ]
> > May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ]
> > May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [
> 
> Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack
> shown in the previous email as this makes it look like it should be.
> 
> Could you go back a little further in the log ? I'd like to see what is
> before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and
> osm_pi_rcv_process.

The log had grown to almost 1G, so I actually deleted it.  Shit, sorry.

> It's also seems weird to me that there is no other
> log message between these two.
> 
> >From the stack trace:
> #3  osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
>     at osm_helper.c:1446
> #4  0x000000000042eed1 in __osm_vl15_poller (p_ptr=0x552498) at
> osm_madw.h:575
> 
> It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller
> 
>     if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) )
>     {
>       if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
>       {
>         osm_log( p_vl->p_log, OSM_LOG_DEBUG,
>                  "__osm_vl15_poller: "
>                  "Servicing p_madw = %p (mad %p req %d)\n",
>                  p_madw, p_madw->p_mad, p_madw->resp_expected);
>       }
> 
>       if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) )
>       {
>         osm_dump_dr_smp( p_vl->p_log,
>                          osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES );  <=== here
>       }
> 
> when it died but I didn't see the previous log message in the code
> "osm_vl15_poller: Servicing p_madw" which I also would have expected.
> [This would have been telling as p_madw->p_mad would have been logged].
> I also didn't see the __osm_vl15_poller entry message either.

well, if it segv'ed maybe it never finished writing out to the file...

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050527/8c00921f/attachment.sig>


More information about the general mailing list