[openib-general] OpenSM crash

Hal Rosenstock halr at voltaire.com
Fri May 27 14:15:40 PDT 2005


On Fri, 2005-05-27 at 14:31, Tom Duffy wrote:
> On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote:
> > I just noticed that my opensm had segv'ed and dumped core.
> 
> BTW, here was the tail of the osm.log:
> 
> May 27 01:44:09 [43005960] -> osm_vendor_get: [
> May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 (mad 0x5f33f0 req 1)
> May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x567908, size = 256.
> May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = 256.
> May 27 01:44:09 [43005960] -> osm_vendor_get: ]
> May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, p_mad = 0x5f3670, size = 256.
> May 27 01:44:09 [43005960] -> osm_mad_pool_get: ]
> May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), modifier = 0x10001, TID = 0x1c149.
> May 27 01:44:09 [43005960] -> osm_vl15_post: [
> May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad 0x5f3670 req 1)
> May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs outstanding.
                                               ^^^^^^^^^^
This looks weird.

> May 27 01:44:09 [43005960] -> osm_vl15_poll: [
> May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread.
> May 27 01:44:09 [43005960] -> osm_vl15_poll: ]
> May 27 01:44:09 [43005960] -> osm_vl15_post: ]
> May 27 01:44:09 [43005960] -> osm_req_get: ]
> May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ]
> May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ]
> May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ]
> May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [

Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack
shown in the previous email as this makes it look like it should be.

Could you go back a little further in the log ? I'd like to see what is
before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and
osm_pi_rcv_process. It's also seems weird to me that there is no other
log message between these two.

>From the stack trace:
#3  osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
    at osm_helper.c:1446
#4  0x000000000042eed1 in __osm_vl15_poller (p_ptr=0x552498) at
osm_madw.h:575

It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller

    if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) )
    {
      if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
      {
        osm_log( p_vl->p_log, OSM_LOG_DEBUG,
                 "__osm_vl15_poller: "
                 "Servicing p_madw = %p (mad %p req %d)\n",
                 p_madw, p_madw->p_mad, p_madw->resp_expected);
      }

      if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) )
      {
        osm_dump_dr_smp( p_vl->p_log,
                         osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES );  <=== here
      }

when it died but I didn't see the previous log message in the code
"osm_vl15_poller: Servicing p_madw" which I also would have expected.
[This would have been telling as p_madw->p_mad would have been logged].
I also didn't see the __osm_vl15_poller entry message either.

-- Hal





More information about the general mailing list