[openib-general] opensm crash

Hal Rosenstock halr at voltaire.com
Thu Jan 27 10:58:23 PST 2005


Hi Tom,

On Thu, 2005-01-27 at 12:53, Tom Duffy wrote:
> I hit control-c to kill osm and got:
> 
> Jan 27 18:47:09 [44808960] -> osm_mad_pool_get: [
> opensm[4627]: *** exception handler: died with signal 11
> Segmentation fault

Looks to me like the following could be the case: 

One thread was shutting down the OSM (osm_opensm_destroy was called and
got at least as far as destroying the SA; subsequent to this the MAD
pool is destroyed) and another thread attempted a get from the MAD pool.
I'm not sure what would prevent this from occuring. 

I am looking into this crash further and am trying to reproduce the
same.

-- Hal


> Here is the last 100 lines of the osm.log
> 
> [root at flopteron2 bin]# tail -100 /var/log/osm.log
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Retiring MAD with TID = 0x2bf9.
> Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: [
> Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: Releasing p_madw = 0x56d9c0, p_mad = 0x599140.
> Jan 27 18:47:04 [43005960] -> osm_vendor_put: [
> Jan 27 18:47:04 [43005960] -> osm_vendor_put: Retiring UMAD 0x599140.
> Jan 27 18:47:04 [43005960] -> osm_vendor_put: ]
> Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: ]
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: 0 QP0 MADs outstanding.
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Posting Dispatcher message OSM_MSG_NO_SMPS_OUTSTANDING.
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: ]
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: ]
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: [
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_LIGHT.
> Jan 27 18:47:04 [43005960] -> __osm_state_mgr_light_sweep_done_msg:
> 
> 
> ******************************************************************
> ********************** LIGHT SWEEP COMPLETE **********************
> ******************************************************************
> 
> 
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_IDLE_TIME_PROCESS in state OSM_SM_STATE_PROCESS_REQUEST.
> Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: [
> Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: ]
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: ]
> Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: [
> Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: ]
> Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: [
> Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: ]
> Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: [
> Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: Off schedule sweep signalled.
> Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: ]
> Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: [
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: [
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: event_wheel ptr:0x5575f8
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: ]
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_mcast_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_mcast_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sa_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_nr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_nr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pir_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pir_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_lr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_lr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_smir_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_smir_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_mcmr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_mcmr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_sr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_infr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_infr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_vlarb_rec_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_vlarb_rec_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_slvl_rec_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_slvl_rec_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pkey_rec_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pkey_rec_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_lftr_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_lftr_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sa_destroy: ]
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list