[ofa-general] Problem with dropped CQE's on RDMA CM channel

Mike Heffner mike.heffner at evergrid.com
Tue Mar 20 12:52:56 PDT 2007


Forgot to mention that this is with OFED 1.1 on a SUSE 10 box:

Linux amd13 2.6.16.21-0.8-smp #1 SMP Mon Jul 3 18:25:39 UTC 2006 x86_64 
x86_64 x86_64 GNU/Linux

with a "Mellanox Technologies MT23108 InfiniHost (rev a1)" PCI-X card 
with firmware version 3.5.0.

Mike Heffner wrote:
> Hi,
> 
> I'm writing a program that allows two clients to communicate over an RC 
> channel that is connected using the RDMA CM. To negotiate a clean 
> shutdown of the channel both clients send IBV_WR_SEND's with the 
> IBV_SEND_SIGNALED bit set. The connection is only rdma_disconnect()'d 
> when a client receives the CQE from its signaled send and the CQE from 
> the peer's incoming IBV_WR_SEND (ie., when the peer receives the send). 
> This ensures that both clients have conceptually called "close()" on 
> both ends of the connection before the connection is torn down and the 
> QP moved into the error state with rdma_disconnect().
> 
> The problem I'm seeing is that occasionally one peer will not receive 
> both CQE's while the other peer has successfully received both and has 
> called rdma_disconnect(). What's odd is that one client may not receive 
> the local CQE for the "signaled" IBV_WR_SEND send even though the peer 
> has received the client's send. Since one peer does not receive both CQE 
> events, the connection remains in an open state and does not get cleaned 
> up appropriately.
> 
> Can you call rdma_disconnect() immediately after posting sends on the 
> QP? I don't see any CQE's come back with errors but they appear to 
> "disappear" and never get signaled on one peer side. Are there any 
> potential race issues to avoid here (it only happens about one out of 
> every 100 connections)?
> 
> Any assistance would be greatly appreciated.
> 
> 
> Thanks,
> 
> Mike
> 



-- 

   Mike Heffner <mike.heffner at evergrid.com>
   EverGrid Software
   Blacksburg, VA USA

   Voice: (540) 443-3500 x603



More information about the general mailing list