[ofa-general] Questions about IPOIB handle last WQE event

Shirley Ma mashirle at us.ibm.com
Tue Jul 22 23:01:06 PDT 2008


On Tue, 2008-07-22 at 21:57 -0700, Roland Dreier wrote:
> If ehca does not complete
> send requests posted to a QP in the error state with a flush error, then
> there is a bug somewhere in the ehca driver/firmware/hardware.
Got it, will discuss this with driver team.

>  > In this case, the send side TX QP is already gone. So there shouldn't be
>  > any send executed from remote to this RX QP. And there is no harmful to
>  > deliver any outstanding CQEs in CQ to consumer. So it's OK not putting
>  > QP in error status before posting last WR, right? Any IB spec specifies
>  > somewhere it's a MUST?
> 
> I guess it might work, but how do we avoid leaking receive work requests
> if we don't transition the QP to error at some point?  There are cases
> where we might be garbage collecting an unused connection and race with
> an incoming message.  Why wouldn't we want to make everything simple and
> transition to the error state?

I agree that the current approach is pretty simple by using
ipoib_cm_rx_event_handler() to handle QP resources release in all cases:
connections set up, established, staled for SRQ.

What's in my mind was to use a common approach for both nonSRQ and SRQ.
Because nonSRQ doesn't have last WQE, but we could do post_send last WR
when receiving DREQ to handle QP resource destroy for both nonSRQ and
SRQ without setting QP to error status.

For stale connections, like not active for a while, remote QP node has
died for any reasons: shutdown or crash, we can put the QP in error
status, then after a certain timer (CQEs has been processed already), we
can destroy the QP.

For QP hasn't been in RTU status yet, like REJ received, we can release
QP resource immediately since the QP is not ready to accept any incoming
messages yet.

How do you think?

thanks
Shirley




More information about the general mailing list