[ofa-general] Question about RDMA CM
Jeff Squyres
jsquyres at cisco.com
Tue Sep 16 13:05:50 PDT 2008
On Sep 16, 2008, at 3:23 PM, Davis, Arlin R wrote:
>> 2. Due to some uninteresting race conditions, we only allow
>> connections to be made "in one direction" (the lower (IP address,
>> port) tuple is the initiator). If the "wrong" MPI process desires to
>> make a connection, it makes a bogus QP and initiates an
>> rdma_connect(). The receiver process then gets the CONNECT_REQUEST
>> event, detects that the connection is coming the "wrong" way,
>> initiates the connection in the "right" direction, and then rejects
>> the "wrong" connection. The initiator expects the rejection, and
>> simply waits for the CONNECT_REQUEST coming in the other direction.
>
> Do you use rdma_cm to create QP's?
No; we are using ibv_create_qp, and then assigning id->qp afterwards.
> If so, you have to be careful
> about re-using a cm_id's QP after rejections or any other conn event
> error. Not sure from your note here but if you happen to move the
> rejected initiator cm_id's qp to the new cm_id created from the
> "right" direction CR coming in as a short cut you may have problems.
We create a new CM ID for the new connection in the "right" direction;
the ID used for the "wrong" direction is eventually discarded.
> Also, do you validate the cm_id context and remote/local addresses
> in your CM processing thread. Could you possibly be getting
> misguided on the established event and be sending to a QP not
> yet preposted? I guess you would see other QP errors in that case.
As far as I can tell, I am not sending to the wrong QP. But it is
complex code, so there certainly can be a bug in this area.
The thing that is weird for me is that setting rnr_retry to 7 makes it
work.
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list