[ewg] RDS - Recovering from RDMA errors

Olaf Kirch olaf.kirch at oracle.com
Thu Jan 17 07:47:57 PST 2008


When I hit a RDMA error (which happens quite frequently now at rds-stress
exit, thanks to the fixed mr pool flushing :) I often see the RDS shutdown_worker
getting stuck (and rmmod hangs). It's waiting for allocated WRs to disappear.
This usually works, as all WQ entries are flushed out. This doesn't happen
when a RDMA transfer generates a remote access error, and that seems to be
intended according to the spec.

I tried destroying the QP first, then we know we can pick off
any remaining WRs still allocated. That didn't work, as the card
seems to generate interrupts even after the QP is gone. This results
in lots of errors on the console complaining about "Completion to
bogus CQ".

I then tried to move the QP to error state instead - this didn't
elicit a storm of kernel messages anymore, but still I seem to get
incoming completions.

Any other suggestions?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax



More information about the ewg mailing list