[ofa-general] Quetions about IPOIB handle last WQE event

Tue Jul 22 19:59:05 PDT 2008

[ ... IB spec stuff about last WQE reached events ... ]

 > The IPoIB-CM implementation takes the approach by posting another WR
 > that completes on the same CQ and wait for this WR to return as a WC.
 > IPoIB first puts the QP in error status, then waits for last WQE event
 > in async event handler by posting a drain WR, the QP resources will be
 > released in when last CEQs being generated. However it works for
 > ConnectionX but not for ehca. 
 > 
 > In ehca implemention it follows Section 11-5.2.5, when the QP gets in
 > the Error state, and there are no more WQEs on the RQ. So these QP
 > resources are never being released thus causes QP resources leakage, no
 > QPs can't be released at all. So when the maxium QPs are reached
 > (default nonSRQ is 128, SRQ is 4K), no more new connections can be
 > built. Nodes can't even be reachable.

I don't understand what the problem is for ehca.  Once the QP is in the
error state, posting a WR to the send queue should complete immediately
with a flush error status, and that completion should trigger IPoIB to
clean up the QP.  What goes wrong with ehca?

 > 			LAST WQE reached event for RX QP0
 > 			post last WR for QP0
 > 			poll_cq
 > 			below only applies to Mellanox, ehca won't see 
 > 			last WQ in SRQ
 > 			----------------
 > 			see last WR for QP0

So you're saying that the send request doesn't complete for ehca?  That
seems like it must be a bug somewhere in the ehca
driver/firmware/hardware.  This has nothing to do with SRQ or last WQE
reached events-- it is the basic requirement that send requests posted
when a QP is in the error state complete with a flush error status.

 > Since nonSRQ doesn't handle async event, it never releases QPs, 128
 > connections will run out soon even in a two nodes cluster by repeating
 > above steps. ( This is another bug, I will submit a fix).

Yes, if non-SRQ doesn't free QPs, then this is another bug.

 > 2. If node-1 fails to send DREQ for any reason to remote, like node-1
 > shutdown, then RX QP in node-2 will be put in the error list after
 > around 21 mins 
 > (IPOIB_CM_RX_TIMEOUT + IPOIB_CM_RX_DELAY 5*256*HZ)
 > #define IPOIB_CM_RX_TIMEOUT     (2 * 256 * HZ)
 > #define IPOIB_CM_RX_DELAY       (3 * 256 * HZ))

 > The timer seems too long for release stale QP resources, we could hit QP
 > run out in a large cluster even for mthca/mlx4.

It is a long timeout, but how often does this case happen?  When a node
crashes?

 > 1. Whether it's a MUST to put QP in error status before posting last WR?
 > if it's a MUST, why?

Yes, it's a must because we don't wnat the send executed, we want it to
complete with an error status.

 > 2. Last WQE event is only generated once for each QP even IPoIB sets QP
 > into error status and the CI surfaced a Local Work Queue Catastrophic
 > Error on the same QP at the same time, is that right?

Umm, a local work queue catastrophic error means something went wrong in
the driver/firmware/hardware -- a consumer shouldn't be able to cause
this type of event.  Finding out why this catastrophic error happens
should help debug things.

 - R.