[ofa-general] missed cq event

Philip Frey1 PHF at zurich.ibm.com
Tue Jun 10 07:52:00 PDT 2008


Steve, thanks for your advice.

Is it possible that there is a bug in OFED 1.3 with regard to non-signaled 
send work requests?
I noticed that when I post send work requests onto my send queue, It 
eventually fills up until I
cannot post sends anymore.
This happens with the Chelsio T3 RNIC and OFED 1.3 whenever I post send 
WR's that have
their flags set to 0. It does not happen though when I post sends with 
IBV_SEND_SIGNALED.
The CQ is empty in the case of non-signaled WR's (as expected) but they 
somehow seem to
be stuck on the send queue.

I use the following code:
static struct ibv_send_wr tx_wr, *bad_wr;

/* create send work request */
tx_wr.wr_id = tx_wr_id++;
tx_wr.next = NULL;
tx_wr.sg_list = sg_list;
tx_wr.num_sge = num_sge;
tx_wr.opcode = IBV_WR_SEND;
tx_wr.send_flags = 0;
 
/* post send work request */
ret = ibv_post_send(qp, &tx_wr, &bad_wr);
if (ret) {
        //error
}

I learned that it might be necessary to post a signaled send WR after 
posting a number of non-signaled
ones in order to clean up the SQ. Is that the case and is there no way to 
post non-signaled WR's that
do not get stuck on the SQ?

Cheers,
 Philip

general-bounces at lists.openfabrics.org wrote on 09.06.2008 18:28:08:

> Philip Frey1 wrote: 
> 
> You are right. Thanks! 
> 
> I have yet another issue:
> 
> Sometimes I get the following message in /var/log/messages of the local 
host:
> 
> post_qp_event - AE qpid 0x4e0 opcode 3 status 0x13 type 0 wrid.hi 
> 0x0 wrid.lo 0x65000000 
> 
> I was looking for the status and opcode in the source and found that 
> opcode 3 means T3_SEND and status 0x13 means TPT_ERR_OUT_OF_RQE. 
> At the remote host I get and opcode 7 (T3_TERMINATE) and status 0x0 
(SUCCESS).
> 
> Clearly there is someone running out of Receive Queue Elements. The 
> error occurred when 
> doing an ibv_post_send() at the local host. Is this a coincidence or
> does the local host 
> somehow know that there are not enough RQE's available at the remote
> host? In other words, 
> does the TPT_ERR_OUT_OF_RQE refer to the local or to the remote receive 
queue?

> 
> You have to consider the type too. type 0 indicates ingress errors, 
> and type 1 indicates egress.
> 
> So the host that logged opcode 3, status 0x13, type 0 received an 
> incoming SEND but there were no RECV's posted at that time.  The 
> result is a connection termination, which results in the TERMINATE 
> event on the peer side.
> 
> Steve._______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080610/0c01aa9f/attachment.html>


More information about the general mailing list