[openib-general] What is proper recovery from ib_poll_cq failure(gen1) ?

Mon Jan 30 22:42:51 PST 2006

Hi jeff.

just a little obvious thing: It doesn't matter if you are working over gen1 / gen2 /windows / linux: the behavior will be the same.

> I'm on Windows running gen1 code and after successfully 
> sending a number
> Of packets my call to ib_poll_cq is failing with a status of 
> IB_WCS_RNR_RETRY_ERR.
> 
> First, am I correct that this means the receiver's resources are not
> ready?

yes. completion with the state RNR_RETRY means "Receiver Not Ready". The requestor sent a message which should consume a WR in the responders RQ, but the RQ (of the responder) is empty.

> If so, what does *that* mean?  My receiver (Linux, gen2) has posted a
> receive 
> WR and is waiting for a CQ event which never comes.

Maybe all of the WR from this RQ were used? maybe there isn't any sync between the two sides?
(maybe you should increase the RNR retry count / timer?)

> Second, what is the correct recovery logic for this?  I've tried
> re-posting the
> Send and re-polling the CQ, but that gives me IB_WCS_WR_FLUSHED_ERR
> Over and over again.  So it seems to me that I have a problem on my
> receive
> Side, but I don't have the foggiest idea what it could be.

You can not recover after this status (and this is the reason why all of the completion's status was flushed with error.

In RC QPs if you got completion with error in the RQ/SQ the error causes the QP to go to ERR state (cannot be recovered).
In UC/UD QPs if you got completion with error in the SQ the error causes the QP to go to SQE state (can be recovered, and the QP state can be changed to RTS).
In UC/UD QPs if you got completion with error in the RQ the error causes the QP to go to ERR state (cannot be recovered).

The only way to "recover" when at least one of the QPs is in error state is to establish the connection once again.

Hope this info helped you ...

Dotan