[ofa-general] missed cq event
Philip Frey1
PHF at zurich.ibm.com
Tue Jun 10 07:52:00 PDT 2008
Steve, thanks for your advice.
Is it possible that there is a bug in OFED 1.3 with regard to non-signaled
send work requests?
I noticed that when I post send work requests onto my send queue, It
eventually fills up until I
cannot post sends anymore.
This happens with the Chelsio T3 RNIC and OFED 1.3 whenever I post send
WR's that have
their flags set to 0. It does not happen though when I post sends with
IBV_SEND_SIGNALED.
The CQ is empty in the case of non-signaled WR's (as expected) but they
somehow seem to
be stuck on the send queue.
I use the following code:
static struct ibv_send_wr tx_wr, *bad_wr;
/* create send work request */
tx_wr.wr_id = tx_wr_id++;
tx_wr.next = NULL;
tx_wr.sg_list = sg_list;
tx_wr.num_sge = num_sge;
tx_wr.opcode = IBV_WR_SEND;
tx_wr.send_flags = 0;
/* post send work request */
ret = ibv_post_send(qp, &tx_wr, &bad_wr);
if (ret) {
//error
}
I learned that it might be necessary to post a signaled send WR after
posting a number of non-signaled
ones in order to clean up the SQ. Is that the case and is there no way to
post non-signaled WR's that
do not get stuck on the SQ?
Cheers,
Philip
general-bounces at lists.openfabrics.org wrote on 09.06.2008 18:28:08:
> Philip Frey1 wrote:
>
> You are right. Thanks!
>
> I have yet another issue:
>
> Sometimes I get the following message in /var/log/messages of the local
host:
>
> post_qp_event - AE qpid 0x4e0 opcode 3 status 0x13 type 0 wrid.hi
> 0x0 wrid.lo 0x65000000
>
> I was looking for the status and opcode in the source and found that
> opcode 3 means T3_SEND and status 0x13 means TPT_ERR_OUT_OF_RQE.
> At the remote host I get and opcode 7 (T3_TERMINATE) and status 0x0
(SUCCESS).
>
> Clearly there is someone running out of Receive Queue Elements. The
> error occurred when
> doing an ibv_post_send() at the local host. Is this a coincidence or
> does the local host
> somehow know that there are not enough RQE's available at the remote
> host? In other words,
> does the TPT_ERR_OUT_OF_RQE refer to the local or to the remote receive
queue?
>
> You have to consider the type too. type 0 indicates ingress errors,
> and type 1 indicates egress.
>
> So the host that logged opcode 3, status 0x13, type 0 received an
> incoming SEND but there were no RECV's posted at that time. The
> result is a connection termination, which results in the TERMINATE
> event on the peer side.
>
> Steve._______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080610/0c01aa9f/attachment.html>
More information about the general
mailing list