[ofw] opensm stuck upon kill

Hefty, Sean sean.hefty at intel.com
Thu Feb 2 10:04:20 PST 2012


I don't see anything that stands out as a bug in the cleanup code.  But with a ref_cnt that high, it seems unlikely that a small window in the cleanup code would result with that many mads being missed.  I need to spend more time reviewing the code.

Have you seen this as a consistent issue, or is this the first time that it's happened? 

> -----Original Message-----
> From: Leonid Keller [mailto:leonid at mellanox.com]
> Sent: Thursday, February 02, 2012 8:42 AM
> To: Hefty, Sean; Tzachi Dar; Smith, Stan
> Cc: Uri Habusha; ofw_list; Irena Gannon
> Subject: RE: opensm stuck upon kill
> 
> I do not have the crashed machine more.
> It was rebooted and the full dump creation failed.
> 
> I can't say about MADs, but I found only one place where an AV is created and
> attached to PD - in the send_mad call.
> And I saw that PD has ref_cnt = 227.
> I think these are references of not released AVs i.e. MADs.
> 
> Could you tell me where I can see not released MADs ?
> The stuck happened after WmProviderDeregister() and destroy_qp.
> WmProviderDeregister is to release all the queued MADs.
> Could there be some MADs that are already or yet not in the queue ?
> 
> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty at intel.com]
> Sent: Thursday, February 02, 2012 6:28 PM
> To: Leonid Keller; Tzachi Dar; Smith, Stan
> Cc: Uri Habusha; ofw_list; Irena Gannon
> Subject: RE: opensm stuck upon kill
> 
> > winmad!WmRegRemoveHandler+0xae is standing here:
> >
> > 	WmProviderDeregister(pRegistration->pProvider, pRegistration);
> > 	pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp,
> > NULL);
> > 	pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd,
> > NULL);
> > >	pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
> >
> > Could you suggest some idea ?
> 
> winmad does not explicitly allocate any address handles.  Can you tell if
> there are any mads which were not returned to the free pool?  You could try
> replacing the NULLs in the above code with ib_sync_destroy (unsure of exact
> name).



More information about the ofw mailing list