[openib-general] [PATCH] IB/cma: add rdma_establish

Tue Sep 12 01:33:22 PDT 2006

Sean Hefty wrote:
> Michael S. Tsirkin wrote:
>> Sean, did we decide what to do for upstream yet?
>> I would say we need something like the below for 2.6.19 too
>> (probably just need to update node type check).
>> And, I like it that this approach leaves all matters of policy
>> to users (such as whether move QP to RTS after asynchronous event
>> or after completion event).

> I will go with a patch similar to this one.  It seems the most flexible.

Just to make sure, you come to say that you would merge this patch 
instead the one that had the CM track local qp numbers and install a 
callback for the consumer QP to catch the async event etc?

Also i'd like to make sure i follow what would happen:

T1) the consumer gets an rx completion on a QP associated with a non 
established CMA ID

[also on some point along time the async handler is called with  a 
COMM_EST async event for this QP]

T2) the consumer calls rdma_establish()

T3) the consumer cma callback is called with ESTABLISHED event and is 
now able to post sends to the QP

Indeed the **patch** for itself is somehow simpler, but the consumer 
must get established event before posting sends to the qp so they need 
to either queue RX-es or modify the QP to RTS before sending the REP.

As i said before this is fine with our iser target as we queue the sole 
possible RX (login request) till getting the established.

Is rdma_established() --> cm_establish() callable from non interruptible 
context? our target does a context jump once the cq handler is called so 
it does the actual processing in thread level, but there may be other 
consumers attempting to call rdma_establish from the hard-irq cq 
callback context.

Also does the patch ensures only one ESTABLISHED event would be called 
for the id, no matter if rdma_establish() and an RTU reception happen in 
parallel?

>> As a side note, reasons for frequent loss of RTU must be investigated.

> A lost RTU shouldn't be any more likely than a lost REQ or REP.  Is the RTU 
> never showing up?  I will look into the ib_cm and see if there's an issue that 
> would cause an RTU not to be retried.

Indeed, my initial suspect was that heavy CPU load on the server node 
prevents the mad/cm threads to be scheduled in, but as REQ messages do 
appear i also thought we should see if a "retried" REP cause a resend on 
the RTU.