[ofa-general] Re: RDMA/iwarp CM question

Steve Wise swise at opengridcomputing.com
Wed Sep 12 12:33:04 PDT 2007



Kanoj Sarcar wrote:
> Response to original mail did not come to me, but I
> see it in the archives, responding back to the
> archived response. Please reply all on your responses.
> 

I did reply to all.  My outgoing folder shows that it went to both of 
your addresses...


> If the driver detaches the incoming (child) connection
> request from the listener at the point of sending the
> IW_CM_EVENT_CONNECT_REQUEST upcall, then for on-card
> connection clean up and child state cleanup in driver,
> OFA must guarantee that a accept/reject downcall will
> be made in the future.

Or you can time it out in your driver.

>                                                       
>                          
> I don't believe that gurantee currently exists. There
> is exactly one failure point in the call chain
> cm_work_handler():process_event():cm_conn_req_handler()
> that driver reject interface is invoked, but at
> multiple other failure points, this is not done.
>                                                       
>                          
> Also, looking at ucma.c, on destruction of a listener,
> I believe ucma_cleanup_events() will go around killing
> all pending IW_CM_EVENT_CONNECT_REQUEST requests, so
> the app will never get a chance to do the
> accept/reject.
>                                                       
>            

It looks to me like ucma_clean_events() calls rdma_destroy_id() / 
iw_destroy_cm_id() / destroy_cm_id() which calls the provider reject 
function.  Or NOT! :)  There's a comment in the IW_CM_STATE_CONN_RECV 
case inside destroy_cm_id():

>                 /*
>                  * App called destroy before/without calling accept after
>                  * receiving connection request event notification or
>                  * returned non zero from the event callback function.
>                  * In either case, must tell the provider to reject.
>                  */

But I don't see the call to reject the connection...

Maybe you could add it and see if it clears up your issue?


> Doesn't this sound like a problem (namely
> provider/card resource leak due to races with listener
> destruct)?
>         

It does.

But MPA mandates a timeout so the connections will get aborted 
eventually by the provider or peer...

But I think you've found a bug...

Steve.



More information about the general mailing list