[ewg] RDS - Recovering from RDMA errors

Sun Jan 20 10:35:09 PST 2008

On Friday 18 January 2008 21:41, Roland Dreier wrote:
> I don't follow this.  All work requests should generate a completion
> eventually, unless you do something like destroy the work queue or
> overrun a CQ.  So what part of the spec are you talking about here?

The part on affiliated asynchronous errors says WQ processing is stopped.
This also happens if we're signalling a remote access error to the
other host.

When a RDMA operation errors out because the remote side destroyed the MR,
the RDMA WQE completes with error 10 (remote access error), which is exepected.
The other end sees an affiliated asynchronous error code 3 (remote access
error), which is also expected.

Now, on the sending system, I'm seeing send queue entries that do not get
completed. The RDMA itself is completed in error; the subsequent SEND
is completed (error 5, flushed) as well. But one or more entries seem to
remain on the queue - at least my book-keeping says so. I double checked
the book-keeping, and it seems accurate...

All very strange.

>  > I tried destroying the QP first, then we know we can pick off
>  > any remaining WRs still allocated. That didn't work, as the card
>  > seems to generate interrupts even after the QP is gone. This results
>  > in lots of errors on the console complaining about "Completion to
>  > bogus CQ".
> 
> Destroying a QP should immediately stop work processing, so no
> completions should be generated once the destroy QP operation
> returns.  I don't see how you get the bogus CQ message in this case --
> it certainly seems like a driver bug.  Unless you mean you are
> destroying the CQ with a QP still attached?  But that shouldn't be
> possible because the CQ's usecnt should be non-zero until all attached
> QPs are freed.  Not sure what could be going on but it sounds bad...

This may be a driver bug, yes.

>  > I then tried to move the QP to error state instead - this didn't
>  > elicit a storm of kernel messages anymore, but still I seem to get
>  > incoming completions.
> 
> The cleanest way to destroy a QP is to move the QP to the error state,
> wait until you have seen a completion for every posted work request
> (the completions generated after the transition to the error state
> should have a "flush" status), and then destroy the QP.

Okay, that's what the RDS code does currently, but I get stuck waiting
for the queue to drain - it simply never does.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax