[ewg] RDS - Recovering from RDMA errors
    Olaf Kirch 
    olaf.kirch at oracle.com
       
    Tue Jan 22 10:24:58 PST 2008
    
    
  
On Sunday 20 January 2008 20:57, Roland Dreier wrote:
> If you could send me some code and a recipe to get the "bogus CQ"
> message, that might be helpful.  Because as far as I can see, there
> shouldn't be any way for a consumer to get that message without a bug
> in the low-level driver.  It's fine if it's a whole big RDS test case,
> I just want to be able to run the test and instrument the low-level
> driver to get a better handle on what's happening.
Okay, I put my current patch queue into a git tree. It's in
the "testing" branch of
git://www.openfabrics.org/~okir/ofed_1_3/linux-2.6.git
git://www.openfabrics.org/~okir/ofed_1_3/rds-tools.git
In order to reproduce the problem, I usually run
while sleep 1; do
	rds-stress -R -r <locip> -s <remip> -p 4000 -c -d2 -t8 -T5 -D1m
done
Within minutes, I get syslog messages saying
Timed out waiting for CQs to be drained - recv: 0 entries, send: 4 entries left
This message originates from net/rds_ib_cm.c - as a workaround, I added
a timeout of 1 second when waiting for the WQs to be drained. I usually
get those stalls after a WQE completes with status 10 (or sometimes 4).
> BTW, what kind of HCA are you using for this testing?
A pair of fairly new Mellanox cards.
Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
    
    
More information about the ewg
mailing list