[ofa-general] Question about RDMA CM
Doug Ledford
dledford at redhat.com
Tue Sep 16 22:45:23 PDT 2008
On Tue, 2008-09-16 at 16:05 -0400, Jeff Squyres wrote:
> On Sep 16, 2008, at 3:23 PM, Davis, Arlin R wrote:
>
> >> 2. Due to some uninteresting race conditions, we only allow
> >> connections to be made "in one direction" (the lower (IP address,
> >> port) tuple is the initiator). If the "wrong" MPI process desires to
> >> make a connection, it makes a bogus QP and initiates an
> >> rdma_connect(). The receiver process then gets the CONNECT_REQUEST
> >> event, detects that the connection is coming the "wrong" way,
> >> initiates the connection in the "right" direction, and then rejects
> >> the "wrong" connection. The initiator expects the rejection, and
> >> simply waits for the CONNECT_REQUEST coming in the other direction.
> >
> > Do you use rdma_cm to create QP's?
>
> No; we are using ibv_create_qp, and then assigning id->qp afterwards.
Don't do that. Assume if rdmacm provides an interface for doing
something, then there is likely a reason. In this case, when you call
rdma_create_qp(), it does more than just call ibv_create_qp() and
ibv_modify_qp() on your behalf. It also pipes information about the
state changes in the qp to the kernel rdma_cm module (by writing the
commands in rdma_cm format to id->channel->fd, which is the rdma_cm fd
not the qp fd, in places like rdma_init_qp_attr()).
> As far as I can tell, I am not sending to the wrong QP. But it is
> complex code, so there certainly can be a bug in this area.
>
> The thing that is weird for me is that setting rnr_retry to 7 makes it
> work.
I didn't look into the kernel code so I couldn't venture a guess as to
whether or not the above is actually a hard requirement, and whether or
not it would explain the rnr_retry of 7 getting around the race
condition, but I would think it's plausible.
--
Doug Ledford <dledford at redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080917/540c64c8/attachment.sig>
More information about the general
mailing list