[openib-general] Re: A Couple of CM Questions

Sean Hefty mshefty at ichips.intel.com
Wed Mar 9 16:37:44 PST 2005


Hal Rosenstock wrote:
>>>My main question has to do with an error path in cm_req_handler. If
>>>cm_init_av fails (lines 1098 or 1103), I get the following crash:
>>
> Also, this fixes the crash when this occurs but the removal of the CM
> module now hangs.
> 
> Any easy way to reproduce this is to clear out the path record DGID
> before sending REP.

an update...

I've been able to reproduce this, and what's happening is that the 
cm_id that the CM created to handle the REQ is hanging waiting for its 
reference count to go to 0, but I'm not entirely sure why yet.

The REQ is received and processed in a CM controlled work queue.  After 
seeing the error, the CM sends a REJ message to the sender.  (The code 
to set the proper reject code is not there yet, but a REJ should still 
be delivered.)  As a result of sending the REJ, the reference count on 
the cm_id is incremented.  The CM then waits in the CM work queue 
thread for the send to complete, which would decrement the reference count.

The send completion should be processed from the context of the MAD 
layer controlled work queue, so I'm not sure why it's not getting 
called.  My planned long term fix is to allow the REJ to be sent 
without holding a reference on the cm_id.  But there's a similar issue 
sending a DREQ or DREP when destroying a cm_id.  So, I'm trying to 
understand this more.

- Sean




More information about the general mailing list