[ofa-general] Problem with dropped CQE's on RDMA CM channel
Mike Heffner
mike.heffner at evergrid.com
Tue Mar 20 12:49:00 PDT 2007
Hi,
I'm writing a program that allows two clients to communicate over an RC
channel that is connected using the RDMA CM. To negotiate a clean
shutdown of the channel both clients send IBV_WR_SEND's with the
IBV_SEND_SIGNALED bit set. The connection is only rdma_disconnect()'d
when a client receives the CQE from its signaled send and the CQE from
the peer's incoming IBV_WR_SEND (ie., when the peer receives the send).
This ensures that both clients have conceptually called "close()" on
both ends of the connection before the connection is torn down and the
QP moved into the error state with rdma_disconnect().
The problem I'm seeing is that occasionally one peer will not receive
both CQE's while the other peer has successfully received both and has
called rdma_disconnect(). What's odd is that one client may not receive
the local CQE for the "signaled" IBV_WR_SEND send even though the peer
has received the client's send. Since one peer does not receive both CQE
events, the connection remains in an open state and does not get cleaned
up appropriately.
Can you call rdma_disconnect() immediately after posting sends on the
QP? I don't see any CQE's come back with errors but they appear to
"disappear" and never get signaled on one peer side. Are there any
potential race issues to avoid here (it only happens about one out of
every 100 connections)?
Any assistance would be greatly appreciated.
Thanks,
Mike
--
Mike Heffner <mike.heffner at evergrid.com>
EverGrid Software
Blacksburg, VA USA
Voice: (540) 443-3500 x603
More information about the general
mailing list