[openib-general] Re: A Couple of CM Questions
Sean Hefty
mshefty at ichips.intel.com
Wed Mar 9 16:37:44 PST 2005
Hal Rosenstock wrote:
>>>My main question has to do with an error path in cm_req_handler. If
>>>cm_init_av fails (lines 1098 or 1103), I get the following crash:
>>
> Also, this fixes the crash when this occurs but the removal of the CM
> module now hangs.
>
> Any easy way to reproduce this is to clear out the path record DGID
> before sending REP.
an update...
I've been able to reproduce this, and what's happening is that the
cm_id that the CM created to handle the REQ is hanging waiting for its
reference count to go to 0, but I'm not entirely sure why yet.
The REQ is received and processed in a CM controlled work queue. After
seeing the error, the CM sends a REJ message to the sender. (The code
to set the proper reject code is not there yet, but a REJ should still
be delivered.) As a result of sending the REJ, the reference count on
the cm_id is incremented. The CM then waits in the CM work queue
thread for the send to complete, which would decrement the reference count.
The send completion should be processed from the context of the MAD
layer controlled work queue, so I'm not sure why it's not getting
called. My planned long term fix is to allow the REJ to be sent
without holding a reference on the cm_id. But there's a similar issue
sending a DREQ or DREP when destroying a cm_id. So, I'm trying to
understand this more.
- Sean
More information about the general
mailing list