[ofa-general] Questions about IPOIB handle last WQE event

Tue Jul 22 21:30:31 PDT 2008

Thanks Roland for your prompt reply.

On Tue, 2008-07-22 at 19:59 -0700, Roland Dreier wrote:

> I don't understand what the problem is for ehca.  Once the QP is in the
> error state, posting a WR to the send queue should complete immediately
> with a flush error status, and that completion should trigger IPoIB to
> clean up the QP.  What goes wrong with ehca?
>  > 			LAST WQE reached event for RX QP0
>  > 			post last WR for QP0
>  > 			poll_cq
>  > 			below only applies to Mellanox, ehca won't see 
>  > 			last WQ in SRQ
>  > 			----------------
>  > 			see last WR for QP0
> 
> So you're saying that the send request doesn't complete for ehca?  That
> seems like it must be a bug somewhere in the ehca
> driver/firmware/hardware.  This has nothing to do with SRQ or last WQE
> reached events-- it is the basic requirement that send requests posted
> when a QP is in the error state complete with a flush error status.
Section: 11-5.2.5: 
------------------
If the HCA supports SRQ, for RC and UD service, the CI shall generate a
Last WQE Reached Affiliated Asynchronous Event on a QP that is in the
Error State and is associated with an SRQ when either:
• a CQE is generated for the last WQE, or
• the QP gets in the Error State and there are no more WQEs on the
  RQ.

I thought the ehca follows this "the QP gets in the Error State and
there are no more WQEs on the RQ." so a CQE is not generated for the
last WQE.

>  > Since nonSRQ doesn't handle async event, it never releases QPs, 128
>  > connections will run out soon even in a two nodes cluster by repeating
>  > above steps. ( This is another bug, I will submit a fix).
> 
> Yes, if non-SRQ doesn't free QPs, then this is another bug.
> 
>  > 2. If node-1 fails to send DREQ for any reason to remote, like node-1
>  > shutdown, then RX QP in node-2 will be put in the error list after
>  > around 21 mins 
>  > (IPOIB_CM_RX_TIMEOUT + IPOIB_CM_RX_DELAY 5*256*HZ)
>  > #define IPOIB_CM_RX_TIMEOUT     (2 * 256 * HZ)
>  > #define IPOIB_CM_RX_DELAY       (3 * 256 * HZ))
> 
>  > The timer seems too long for release stale QP resources, we could hit QP
>  > run out in a large cluster even for mthca/mlx4.

For a 2K nodes cluster, each node has 2 ports, IPoIB-CM maxium RX QPs
are 4K, each node has 2 ports, even one node has a crash, the connection
will be lost for 21 mins for that node after reboot.

> It is a long timeout, but how often does this case happen?  When a node
> crashes?
> 
>  > 1. Whether it's a MUST to put QP in error status before posting last WR?
>  > if it's a MUST, why?
> 
> Yes, it's a must because we don't wnat the send executed, we want it to
> complete with an error status.

In this case, the send side TX QP is already gone. So there shouldn't be
any send executed from remote to this RX QP. And there is no harmful to
deliver any outstanding CQEs in CQ to consumer. So it's OK not putting
QP in error status before posting last WR, right? Any IB spec specifies
somewhere it's a MUST?

>  > 2. Last WQE event is only generated once for each QP even IPoIB sets QP
>  > into error status and the CI surfaced a Local Work Queue Catastrophic
>  > Error on the same QP at the same time, is that right?
> 
> Umm, a local work queue catastrophic error means something went wrong in
> the driver/firmware/hardware -- a consumer shouldn't be able to cause
> this type of event.  Finding out why this catastrophic error happens
> should help debug things.

Sorry my question was not clear. It should be whether it's possible to
have two last WQE reached event on the same QP: one is from consumer
setting the QP to error status, one is from driver/FW/HW catastrophic
error if any?

Thanks
Shirley