[ofa-general] Quetions about IPOIB handle last WQE event
Roland Dreier
rdreier at cisco.com
Tue Jul 22 19:59:05 PDT 2008
[ ... IB spec stuff about last WQE reached events ... ]
> The IPoIB-CM implementation takes the approach by posting another WR
> that completes on the same CQ and wait for this WR to return as a WC.
> IPoIB first puts the QP in error status, then waits for last WQE event
> in async event handler by posting a drain WR, the QP resources will be
> released in when last CEQs being generated. However it works for
> ConnectionX but not for ehca.
>
> In ehca implemention it follows Section 11-5.2.5, when the QP gets in
> the Error state, and there are no more WQEs on the RQ. So these QP
> resources are never being released thus causes QP resources leakage, no
> QPs can't be released at all. So when the maxium QPs are reached
> (default nonSRQ is 128, SRQ is 4K), no more new connections can be
> built. Nodes can't even be reachable.
I don't understand what the problem is for ehca. Once the QP is in the
error state, posting a WR to the send queue should complete immediately
with a flush error status, and that completion should trigger IPoIB to
clean up the QP. What goes wrong with ehca?
> LAST WQE reached event for RX QP0
> post last WR for QP0
> poll_cq
> below only applies to Mellanox, ehca won't see
> last WQ in SRQ
> ----------------
> see last WR for QP0
So you're saying that the send request doesn't complete for ehca? That
seems like it must be a bug somewhere in the ehca
driver/firmware/hardware. This has nothing to do with SRQ or last WQE
reached events-- it is the basic requirement that send requests posted
when a QP is in the error state complete with a flush error status.
> Since nonSRQ doesn't handle async event, it never releases QPs, 128
> connections will run out soon even in a two nodes cluster by repeating
> above steps. ( This is another bug, I will submit a fix).
Yes, if non-SRQ doesn't free QPs, then this is another bug.
> 2. If node-1 fails to send DREQ for any reason to remote, like node-1
> shutdown, then RX QP in node-2 will be put in the error list after
> around 21 mins
> (IPOIB_CM_RX_TIMEOUT + IPOIB_CM_RX_DELAY 5*256*HZ)
> #define IPOIB_CM_RX_TIMEOUT (2 * 256 * HZ)
> #define IPOIB_CM_RX_DELAY (3 * 256 * HZ))
> The timer seems too long for release stale QP resources, we could hit QP
> run out in a large cluster even for mthca/mlx4.
It is a long timeout, but how often does this case happen? When a node
crashes?
> 1. Whether it's a MUST to put QP in error status before posting last WR?
> if it's a MUST, why?
Yes, it's a must because we don't wnat the send executed, we want it to
complete with an error status.
> 2. Last WQE event is only generated once for each QP even IPoIB sets QP
> into error status and the CI surfaced a Local Work Queue Catastrophic
> Error on the same QP at the same time, is that right?
Umm, a local work queue catastrophic error means something went wrong in
the driver/firmware/hardware -- a consumer shouldn't be able to cause
this type of event. Finding out why this catastrophic error happens
should help debug things.
- R.
More information about the general
mailing list