[ewg] RDS - Recovering from RDMA errors

Olaf Kirch olaf.kirch at oracle.com
Tue Jan 22 10:24:58 PST 2008


On Sunday 20 January 2008 20:57, Roland Dreier wrote:
> If you could send me some code and a recipe to get the "bogus CQ"
> message, that might be helpful.  Because as far as I can see, there
> shouldn't be any way for a consumer to get that message without a bug
> in the low-level driver.  It's fine if it's a whole big RDS test case,
> I just want to be able to run the test and instrument the low-level
> driver to get a better handle on what's happening.

Okay, I put my current patch queue into a git tree. It's in
the "testing" branch of

git://www.openfabrics.org/~okir/ofed_1_3/linux-2.6.git
git://www.openfabrics.org/~okir/ofed_1_3/rds-tools.git

In order to reproduce the problem, I usually run

while sleep 1; do
	rds-stress -R -r <locip> -s <remip> -p 4000 -c -d2 -t8 -T5 -D1m
done

Within minutes, I get syslog messages saying

Timed out waiting for CQs to be drained - recv: 0 entries, send: 4 entries left

This message originates from net/rds_ib_cm.c - as a workaround, I added
a timeout of 1 second when waiting for the WQs to be drained. I usually
get those stalls after a WQE completes with status 10 (or sometimes 4).

> BTW, what kind of HCA are you using for this testing?

A pair of fairly new Mellanox cards.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax



More information about the ewg mailing list