[ofa-general] Question about RDMA CM

Tue Sep 16 10:02:14 PDT 2008

Greetings.  I'm trying to finish up RDMA CM support in Open MPI, but  
am running into a problem on IB that I have been beating my head over  
for about a week and I can't figure it out (seem to work fine on  
iWARP).  I know Sean is out on sabbatical; I'm hoping that someone  
will have an insight into my problem anyway.

Short version:
==============

Open MPI uses a separate thread for most of its RDMA CM actions to  
ensure that they can respond in a timely manner / not timeout.  All  
the code seems to work fine on iWARP (tested with Chelsio T3's), but  
on some older Mellanox HCAs, I sometimes get RNRs after both sides get  
the ESTABLISHED event and the initiator sends the first message on the  
new RC QP (not SRQ).  I am *sure* that a receive buffer is posted at  
the receiver, and the QPs appear to be transitioning INIT -> RTR ->  
RTS properly.  I cannot figure out why I am getting RNRs.  These RNRs  
*only* seem to happen when either or both of the initiator or receiver  
servers are fully busy (i.e., all cores are 100% busy).

Longer version:
===============

All the code is currently in a development mercurial branch (not on  
the main Open MPI SVN):

     http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/openib-fd- 
progress/

As mentioned above, this all seems to work fine on Chelsio T3's.  I'm  
getting these RNRs on Mellanox 2 port DDR cards, MT_00A0010001 using  
OFED 1.3.1.  I have not tried on other IB cards.  All my servers are 4  
core intel machines (slightly old -- pre-woodcrest).  I can pretty  
consistently get the problems to occur when I run a simple MPI "ring"  
test program (send a message around in a ring) across 2 servers (4  
cores/ea).  OMPI uses shared memory for on-node communication and  
verbs for off-node communication.  The program runs fine when I do not  
use RDMA CM, but gets RNRs for some connections when I use RDMA CM  
over IB and all 4 cores on both servers are running MPI processes  
(i.e., are 100% busy polling for message passing progress).  The  
connectivity looks like this:

node 1
   |--- proc A <- shmem <- proc B <- shmem <- proc C <- shmem <- proc  
D <-|

verbs                                                                  
verbs
   |--> proc E -> shmem -> proc F -> shmem -> proc G -> shmem -> proc  
H --|
node 2

Random notes:

1. Open MPI uses a progress thread in its OpenFabrics support for the  
RDMA CM.  rdma_connect() is initiated from the main thread, but all  
other events are handled from the progress thread.

2. Due to some uninteresting race conditions, we only allow  
connections to be made "in one direction" (the lower (IP address,  
port) tuple is the initiator).  If the "wrong" MPI process desires to  
make a connection, it makes a bogus QP and initiates an  
rdma_connect().  The receiver process then gets the CONNECT_REQUEST  
event, detects that the connection is coming the "wrong" way,  
initiates the connection in the "right" direction, and then rejects  
the "wrong" connection.  The initiator expects the rejection, and  
simply waits for the CONNECT_REQUEST coming in the other direction.

3. To accommodate iWARP's "initiator must send first" requirement, we  
have the connection sequence in OMPI only post a single receive buffer  
that will later be used for an OMPI-level CTS.  So during the RDMA CM  
wireup, there is only *one* receive buffer posted.  Once the  
ESTABLISHED event arrives, OMPI posts all the rest of its normal  
receive buffers and then sends the CTS to the peer that will consume  
the 1 buffer that was previously posted (which is guaranteed to have  
its CTS buffer posted).  OMPI does not start sending anything else  
until it gets the CTS from its peer.

4. OMPI normally sets non-SRQ RC QP's rnr_retry_count value to 0  
because OMPI has its own flow control (read: if we ever get an RNR,  
it's an OMPI bug).

Consider a scenario where MPI process A wants to connect to MPI  
process B (on different servers).  Let's assume that A->B is the  
"right" direction for simplicity.  Here's what happens:

A: creates QP, posts the 1 CTS receive buffer, and calls rdma_connect()
B: gets CONNECT_REQUEST, creates QP, posts the 1 CTS receive buffer,  
and calls rdma_accept()
    --> I've verified that B's QP is transitioned to RTR properly
A and B: get ESTABLISHED
    --> I've verified that A and B's QPs are transitioned to RTS  
properly
A: posts its normal OMPI receive buffers
A: sends the CTS
A: sometimes gets IBV_WC_RNR_RETRY_EXC_ERR

I have done the following to try to track down what is happening:

- after B calls ibv_post_recv(), call sleep(5) before calling  
rdma_accept() -- just to ensure that the buffer really is posted.  No  
effect; A still gets RNRs.

- verified that A and B's QPs are transitioning into RTR and RTS  
properly.  They seem to be doing this just fine.

- increased the rnr_retry_count on the new QP.  When I set it to 0-6,  
the problem still occurs.  When I set it to 7 (infinite), *the problem  
goes away*.

This last one (setting rnr_retry_count=7) is what kills me.  It seems  
to imply that there is a race condition in the sequence somewhere, but  
I just can't figure out where.  Both sides are posting receive  
buffers.  Both sides are getting ESTABLISHED.  Both sides are  
transitioning INIT -> RTR -> RTS properly.  Why is there an RNR  
occurring?

As noted above, this *only* happens when all the cores on my servers  
are fully busy.  If I only run 1 or 2 MPI processes on both servers,  
the problem does not occur.  This seems fishy, but I don't know  
exactly what it means.

This could certainly be a bug in my code, but I just can't figure out  
where.  Any insights or advice would be greatly appreciated; many  
thanks.

-- 
Jeff Squyres
Cisco Systems