[ofw] opensm stuck upon kill
Hefty, Sean
sean.hefty at intel.com
Thu Feb 2 10:04:20 PST 2012
I don't see anything that stands out as a bug in the cleanup code. But with a ref_cnt that high, it seems unlikely that a small window in the cleanup code would result with that many mads being missed. I need to spend more time reviewing the code.
Have you seen this as a consistent issue, or is this the first time that it's happened?
> -----Original Message-----
> From: Leonid Keller [mailto:leonid at mellanox.com]
> Sent: Thursday, February 02, 2012 8:42 AM
> To: Hefty, Sean; Tzachi Dar; Smith, Stan
> Cc: Uri Habusha; ofw_list; Irena Gannon
> Subject: RE: opensm stuck upon kill
>
> I do not have the crashed machine more.
> It was rebooted and the full dump creation failed.
>
> I can't say about MADs, but I found only one place where an AV is created and
> attached to PD - in the send_mad call.
> And I saw that PD has ref_cnt = 227.
> I think these are references of not released AVs i.e. MADs.
>
> Could you tell me where I can see not released MADs ?
> The stuck happened after WmProviderDeregister() and destroy_qp.
> WmProviderDeregister is to release all the queued MADs.
> Could there be some MADs that are already or yet not in the queue ?
>
> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty at intel.com]
> Sent: Thursday, February 02, 2012 6:28 PM
> To: Leonid Keller; Tzachi Dar; Smith, Stan
> Cc: Uri Habusha; ofw_list; Irena Gannon
> Subject: RE: opensm stuck upon kill
>
> > winmad!WmRegRemoveHandler+0xae is standing here:
> >
> > WmProviderDeregister(pRegistration->pProvider, pRegistration);
> > pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp,
> > NULL);
> > pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd,
> > NULL);
> > > pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
> >
> > Could you suggest some idea ?
>
> winmad does not explicitly allocate any address handles. Can you tell if
> there are any mads which were not returned to the free pool? You could try
> replacing the NULLs in the above code with ib_sync_destroy (unsure of exact
> name).
More information about the ofw
mailing list