[ewg] RDS - Recovering from RDMA errors
rdreier at cisco.com
Fri Jan 18 12:41:45 PST 2008
> When I hit a RDMA error (which happens quite frequently now at rds-stress
> exit, thanks to the fixed mr pool flushing :) I often see the RDS shutdown_worker
> getting stuck (and rmmod hangs). It's waiting for allocated WRs to disappear.
> This usually works, as all WQ entries are flushed out. This doesn't happen
> when a RDMA transfer generates a remote access error, and that seems to be
> intended according to the spec.
I don't follow this. All work requests should generate a completion
eventually, unless you do something like destroy the work queue or
overrun a CQ. So what part of the spec are you talking about here?
> I tried destroying the QP first, then we know we can pick off
> any remaining WRs still allocated. That didn't work, as the card
> seems to generate interrupts even after the QP is gone. This results
> in lots of errors on the console complaining about "Completion to
> bogus CQ".
Destroying a QP should immediately stop work processing, so no
completions should be generated once the destroy QP operation
returns. I don't see how you get the bogus CQ message in this case --
it certainly seems like a driver bug. Unless you mean you are
destroying the CQ with a QP still attached? But that shouldn't be
possible because the CQ's usecnt should be non-zero until all attached
QPs are freed. Not sure what could be going on but it sounds bad...
> I then tried to move the QP to error state instead - this didn't
> elicit a storm of kernel messages anymore, but still I seem to get
> incoming completions.
The cleanest way to destroy a QP is to move the QP to the error state,
wait until you have seen a completion for every posted work request
(the completions generated after the transition to the error state
should have a "flush" status), and then destroy the QP.
More information about the ewg