[ofa-general] Question about RDMA CM
Jeff Squyres
jsquyres at cisco.com
Tue Sep 16 10:02:14 PDT 2008
Greetings. I'm trying to finish up RDMA CM support in Open MPI, but
am running into a problem on IB that I have been beating my head over
for about a week and I can't figure it out (seem to work fine on
iWARP). I know Sean is out on sabbatical; I'm hoping that someone
will have an insight into my problem anyway.
Short version:
==============
Open MPI uses a separate thread for most of its RDMA CM actions to
ensure that they can respond in a timely manner / not timeout. All
the code seems to work fine on iWARP (tested with Chelsio T3's), but
on some older Mellanox HCAs, I sometimes get RNRs after both sides get
the ESTABLISHED event and the initiator sends the first message on the
new RC QP (not SRQ). I am *sure* that a receive buffer is posted at
the receiver, and the QPs appear to be transitioning INIT -> RTR ->
RTS properly. I cannot figure out why I am getting RNRs. These RNRs
*only* seem to happen when either or both of the initiator or receiver
servers are fully busy (i.e., all cores are 100% busy).
Longer version:
===============
All the code is currently in a development mercurial branch (not on
the main Open MPI SVN):
http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/openib-fd-
progress/
As mentioned above, this all seems to work fine on Chelsio T3's. I'm
getting these RNRs on Mellanox 2 port DDR cards, MT_00A0010001 using
OFED 1.3.1. I have not tried on other IB cards. All my servers are 4
core intel machines (slightly old -- pre-woodcrest). I can pretty
consistently get the problems to occur when I run a simple MPI "ring"
test program (send a message around in a ring) across 2 servers (4
cores/ea). OMPI uses shared memory for on-node communication and
verbs for off-node communication. The program runs fine when I do not
use RDMA CM, but gets RNRs for some connections when I use RDMA CM
over IB and all 4 cores on both servers are running MPI processes
(i.e., are 100% busy polling for message passing progress). The
connectivity looks like this:
node 1
|--- proc A <- shmem <- proc B <- shmem <- proc C <- shmem <- proc
D <-|
verbs
verbs
|--> proc E -> shmem -> proc F -> shmem -> proc G -> shmem -> proc
H --|
node 2
Random notes:
1. Open MPI uses a progress thread in its OpenFabrics support for the
RDMA CM. rdma_connect() is initiated from the main thread, but all
other events are handled from the progress thread.
2. Due to some uninteresting race conditions, we only allow
connections to be made "in one direction" (the lower (IP address,
port) tuple is the initiator). If the "wrong" MPI process desires to
make a connection, it makes a bogus QP and initiates an
rdma_connect(). The receiver process then gets the CONNECT_REQUEST
event, detects that the connection is coming the "wrong" way,
initiates the connection in the "right" direction, and then rejects
the "wrong" connection. The initiator expects the rejection, and
simply waits for the CONNECT_REQUEST coming in the other direction.
3. To accommodate iWARP's "initiator must send first" requirement, we
have the connection sequence in OMPI only post a single receive buffer
that will later be used for an OMPI-level CTS. So during the RDMA CM
wireup, there is only *one* receive buffer posted. Once the
ESTABLISHED event arrives, OMPI posts all the rest of its normal
receive buffers and then sends the CTS to the peer that will consume
the 1 buffer that was previously posted (which is guaranteed to have
its CTS buffer posted). OMPI does not start sending anything else
until it gets the CTS from its peer.
4. OMPI normally sets non-SRQ RC QP's rnr_retry_count value to 0
because OMPI has its own flow control (read: if we ever get an RNR,
it's an OMPI bug).
Consider a scenario where MPI process A wants to connect to MPI
process B (on different servers). Let's assume that A->B is the
"right" direction for simplicity. Here's what happens:
A: creates QP, posts the 1 CTS receive buffer, and calls rdma_connect()
B: gets CONNECT_REQUEST, creates QP, posts the 1 CTS receive buffer,
and calls rdma_accept()
--> I've verified that B's QP is transitioned to RTR properly
A and B: get ESTABLISHED
--> I've verified that A and B's QPs are transitioned to RTS
properly
A: posts its normal OMPI receive buffers
A: sends the CTS
A: sometimes gets IBV_WC_RNR_RETRY_EXC_ERR
I have done the following to try to track down what is happening:
- after B calls ibv_post_recv(), call sleep(5) before calling
rdma_accept() -- just to ensure that the buffer really is posted. No
effect; A still gets RNRs.
- verified that A and B's QPs are transitioning into RTR and RTS
properly. They seem to be doing this just fine.
- increased the rnr_retry_count on the new QP. When I set it to 0-6,
the problem still occurs. When I set it to 7 (infinite), *the problem
goes away*.
This last one (setting rnr_retry_count=7) is what kills me. It seems
to imply that there is a race condition in the sequence somewhere, but
I just can't figure out where. Both sides are posting receive
buffers. Both sides are getting ESTABLISHED. Both sides are
transitioning INIT -> RTR -> RTS properly. Why is there an RNR
occurring?
As noted above, this *only* happens when all the cores on my servers
are fully busy. If I only run 1 or 2 MPI processes on both servers,
the problem does not occur. This seems fishy, but I don't know
exactly what it means.
This could certainly be a bug in my code, but I just can't figure out
where. Any insights or advice would be greatly appreciated; many
thanks.
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list