[ofa-general] Question about RDMA CM

Jeff Squyres jsquyres at cisco.com
Tue Sep 16 13:05:50 PDT 2008


On Sep 16, 2008, at 3:23 PM, Davis, Arlin R wrote:

>> 2. Due to some uninteresting race conditions, we only allow
>> connections to be made "in one direction" (the lower (IP address,
>> port) tuple is the initiator).  If the "wrong" MPI process desires to
>> make a connection, it makes a bogus QP and initiates an
>> rdma_connect().  The receiver process then gets the CONNECT_REQUEST
>> event, detects that the connection is coming the "wrong" way,
>> initiates the connection in the "right" direction, and then rejects
>> the "wrong" connection.  The initiator expects the rejection, and
>> simply waits for the CONNECT_REQUEST coming in the other direction.
>
> Do you use rdma_cm to create QP's?

No; we are using ibv_create_qp, and then assigning id->qp afterwards.

> If so, you have to be careful
> about re-using a cm_id's QP after rejections or any other conn event
> error. Not sure from your note here but if you happen to move the
> rejected initiator cm_id's qp to the new cm_id created from the
> "right" direction CR coming in as a short cut you may have problems.

We create a new CM ID for the new connection in the "right" direction;  
the ID used for the "wrong" direction is eventually discarded.

> Also, do you validate the cm_id context and remote/local addresses
> in your CM processing thread. Could you possibly be getting
> misguided on the established event and be sending to a QP not
> yet preposted? I guess you would see other QP errors in that case.


As far as I can tell, I am not sending to the wrong QP.  But it is  
complex code, so there certainly can be a bug in this area.

The thing that is weird for me is that setting rnr_retry to 7 makes it  
work.

-- 
Jeff Squyres
Cisco Systems




More information about the general mailing list