[ofa-general] Quetions about IPOIB handle last WQE event

Tue Jul 22 17:21:22 PDT 2008

Hello Roland, Eli,

We have seen heavy QP resource leakages in ehca for both nonSRQ and SRQ
for IPoIB-CM mode. I have several patches to fix this issue. Before I
submit these patches for review. I would like to discuss current IPoIB
resource release based on last WQE first.

In IB spec Section 10.3.1             
--------------------------
Note, for QPs that are associated with an SRQ, the Consumer should take
the QP through the Error State before invoking a Destroy QP or a Modify
QP to the Reset State. The Consumer may invoke the Destroy QP without
first performing a Modify QP to the Error State and waiting for the
Affiliated Asynchronous Last WQE Reached Event. However, if the Consumer
does not wait for the Affiliated Asynchronous Last WQE Reached Event,
then WQE and Data Segment leakage may occur. Therefore, it is
programming practice to tear down a QP that is associated with an SRQ by
using the following process:

• Put the QP in the Error State; 
• wait for the Affiliated Asynchronous Last WQE Reached Event;
• either:
    • drain the CQ by invoking the Poll CQ verb and either wait for CQ
to be empty or the number of Poll CQ operations has exceeded CQ capacity
size; or
    • post another WR that completes on the same CQ and wait for this
       WR to return as a WC;
• and then invoke a Destroy QP or Reset QP.

Section: 11-5.2.5: 
------------------
If the HCA supports SRQ, for RC and UD service, the CI shall generate a
Last WQE Reached Affiliated Asynchronous Event on a QP that is in the
Error State and is associated with an SRQ when either:
• a CQE is generated for the last WQE, or
• the QP gets in the Error State and there are no more WQEs on the
  RQ.

The IPoIB-CM implementation takes the approach by posting another WR
that completes on the same CQ and wait for this WR to return as a WC.
IPoIB first puts the QP in error status, then waits for last WQE event
in async event handler by posting a drain WR, the QP resources will be
released in when last CEQs being generated. However it works for
ConnectionX but not for ehca. 

In ehca implemention it follows Section 11-5.2.5, when the QP gets in
the Error state, and there are no more WQEs on the RQ. So these QP
resources are never being released thus causes QP resources leakage, no
QPs can't be released at all. So when the maxium QPs are reached
(default nonSRQ is 128, SRQ is 4K), no more new connections can be
built. Nodes can't even be reachable.

We can see this problem even in an idle cluster. In an idle cluster, if
we have occasional ping, like ping every 10mins. The arp entry life time
is around 6 mins by default if no more traffics, (route cache timeout
300s + random(1/2 * reachable_time(15s), 3/2* reachable_time) + gc
clean(60s))

Then arp entries would be expired, a neigh detroy will be called from
neigh garbage collection, then ipoib neigh cleanup gets called. It will
destroy tx QPs and destroy cm_id will send DREQ to remote connection.
The remote side after receives DREQ, it puts the QP to error status.
Then the last WQE reached event will be generated.

node-1			node-2
---------------------------------------
ping -c 1 node-2
TX QP0 create
arp entry		RX QP0 create
wait for 10 mins
arp entry is released
ipoib_neigh_cleanup
destroy TX QP
destroy cm_id
send DREQ		received DREQ
			put QP0 in error status
			wait for async event
			LAST WQE reached event for RX QP0
			post last WR for QP0
			poll_cq
			below only applies to Mellanox, ehca won't see 
			last WQ in SRQ
			----------------
			see last WR for QP0
			put QP0 in reap_list for clean up
			queue reap work
			reap work: clean QP0
			-----------------
			ehca still keeps QP0

Repeat above steps in a large cluster, number of RX QPs will eventually
run out for SRQ.

Since nonSRQ doesn't handle async event, it never releases QPs, 128
connections will run out soon even in a two nodes cluster by repeating
above steps. ( This is another bug, I will submit a fix).

The above approach has a couple issues:

1. It works only for mthca/mlx4 not for ehca

2. If node-1 fails to send DREQ for any reason to remote, like node-1
shutdown, then RX QP in node-2 will be put in the error list after
around 21 mins 
(IPOIB_CM_RX_TIMEOUT + IPOIB_CM_RX_DELAY 5*256*HZ)
#define IPOIB_CM_RX_TIMEOUT     (2 * 256 * HZ)
#define IPOIB_CM_RX_DELAY       (3 * 256 * HZ))

The timer seems too long for release stale QP resources, we could hit QP
run out in a large cluster even for mthca/mlx4.

My questions here are:

1. Whether it's a MUST to put QP in error status before posting last WR?
if it's a MUST, why?

2. Last WQE event is only generated once for each QP even IPoIB sets QP
into error status and the CI surfaced a Local Work Queue Catastrophic
Error on the same QP at the same time, is that right?

I will post my patchsets for review based on this discussion outcome.

Thanks
Shirley