[ewg] RDS - Recovering from RDMA errors
Olaf Kirch
olaf.kirch at oracle.com
Tue Jan 22 10:24:58 PST 2008
On Sunday 20 January 2008 20:57, Roland Dreier wrote:
> If you could send me some code and a recipe to get the "bogus CQ"
> message, that might be helpful. Because as far as I can see, there
> shouldn't be any way for a consumer to get that message without a bug
> in the low-level driver. It's fine if it's a whole big RDS test case,
> I just want to be able to run the test and instrument the low-level
> driver to get a better handle on what's happening.
Okay, I put my current patch queue into a git tree. It's in
the "testing" branch of
git://www.openfabrics.org/~okir/ofed_1_3/linux-2.6.git
git://www.openfabrics.org/~okir/ofed_1_3/rds-tools.git
In order to reproduce the problem, I usually run
while sleep 1; do
rds-stress -R -r <locip> -s <remip> -p 4000 -c -d2 -t8 -T5 -D1m
done
Within minutes, I get syslog messages saying
Timed out waiting for CQs to be drained - recv: 0 entries, send: 4 entries left
This message originates from net/rds_ib_cm.c - as a workaround, I added
a timeout of 1 second when waiting for the WQs to be drained. I usually
get those stalls after a WQE completes with status 10 (or sometimes 4).
> BTW, what kind of HCA are you using for this testing?
A pair of fairly new Mellanox cards.
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
More information about the ewg
mailing list