[ewg] RDS - Recovering from RDMA errors

Fri Jan 18 12:41:45 PST 2008

 > When I hit a RDMA error (which happens quite frequently now at rds-stress
 > exit, thanks to the fixed mr pool flushing :) I often see the RDS shutdown_worker
 > getting stuck (and rmmod hangs). It's waiting for allocated WRs to disappear.
 > This usually works, as all WQ entries are flushed out. This doesn't happen
 > when a RDMA transfer generates a remote access error, and that seems to be
 > intended according to the spec.

I don't follow this.  All work requests should generate a completion
eventually, unless you do something like destroy the work queue or
overrun a CQ.  So what part of the spec are you talking about here?

 > I tried destroying the QP first, then we know we can pick off
 > any remaining WRs still allocated. That didn't work, as the card
 > seems to generate interrupts even after the QP is gone. This results
 > in lots of errors on the console complaining about "Completion to
 > bogus CQ".

Destroying a QP should immediately stop work processing, so no
completions should be generated once the destroy QP operation
returns.  I don't see how you get the bogus CQ message in this case --
it certainly seems like a driver bug.  Unless you mean you are
destroying the CQ with a QP still attached?  But that shouldn't be
possible because the CQ's usecnt should be non-zero until all attached
QPs are freed.  Not sure what could be going on but it sounds bad...

 > I then tried to move the QP to error state instead - this didn't
 > elicit a storm of kernel messages anymore, but still I seem to get
 > incoming completions.

The cleanest way to destroy a QP is to move the QP to the error state,
wait until you have seen a completion for every posted work request
(the completions generated after the transition to the error state
should have a "flush" status), and then destroy the QP.

 - R.