[ofa-general] Problem with dropped CQE's on RDMA CM channel

Mike Heffner mike.heffner at evergrid.com
Tue Mar 20 12:49:00 PDT 2007


Hi,

I'm writing a program that allows two clients to communicate over an RC 
channel that is connected using the RDMA CM. To negotiate a clean 
shutdown of the channel both clients send IBV_WR_SEND's with the 
IBV_SEND_SIGNALED bit set. The connection is only rdma_disconnect()'d 
when a client receives the CQE from its signaled send and the CQE from 
the peer's incoming IBV_WR_SEND (ie., when the peer receives the send). 
This ensures that both clients have conceptually called "close()" on 
both ends of the connection before the connection is torn down and the 
QP moved into the error state with rdma_disconnect().

The problem I'm seeing is that occasionally one peer will not receive 
both CQE's while the other peer has successfully received both and has 
called rdma_disconnect(). What's odd is that one client may not receive 
the local CQE for the "signaled" IBV_WR_SEND send even though the peer 
has received the client's send. Since one peer does not receive both CQE 
events, the connection remains in an open state and does not get cleaned 
up appropriately.

Can you call rdma_disconnect() immediately after posting sends on the 
QP? I don't see any CQE's come back with errors but they appear to 
"disappear" and never get signaled on one peer side. Are there any 
potential race issues to avoid here (it only happens about one out of 
every 100 connections)?

Any assistance would be greatly appreciated.


Thanks,

Mike

-- 

   Mike Heffner <mike.heffner at evergrid.com>
   EverGrid Software
   Blacksburg, VA USA

   Voice: (540) 443-3500 x603



More information about the general mailing list