[ewg] RDS - Recovering from RDMA errors
Dotan Barak
dotanb at dev.mellanox.co.il
Thu Jan 17 07:51:28 PST 2008
Olaf Kirch wrote:
> When I hit a RDMA error (which happens quite frequently now at rds-stress
> exit, thanks to the fixed mr pool flushing :) I often see the RDS shutdown_worker
> getting stuck (and rmmod hangs). It's waiting for allocated WRs to disappear.
> This usually works, as all WQ entries are flushed out. This doesn't happen
> when a RDMA transfer generates a remote access error, and that seems to be
> intended according to the spec.
>
> I tried destroying the QP first, then we know we can pick off
> any remaining WRs still allocated. That didn't work, as the card
> seems to generate interrupts even after the QP is gone. This results
> in lots of errors on the console complaining about "Completion to
> bogus CQ".
>
> I then tried to move the QP to error state instead - this didn't
> elicit a storm of kernel messages anymore, but still I seem to get
> incoming completions.
>
> Any other suggestions?
>
Moving the QP to error state flushes all of the outstanding WRs and
create a completion for each WR.
If you want to delete all of the outstanding WRs, you should move the QP
state to reset.
(Is this is what you asked?)
Dotan
More information about the ewg
mailing list