[ewg] RDS - Recovering from RDMA errors

Roland Dreier rdreier at cisco.com
Sun Jan 20 11:57:58 PST 2008


 > The part on affiliated asynchronous errors says WQ processing is stopped.
 > This also happens if we're signalling a remote access error to the
 > other host.
 > 
 > When a RDMA operation errors out because the remote side destroyed the MR,
 > the RDMA WQE completes with error 10 (remote access error), which is exepected.
 > The other end sees an affiliated asynchronous error code 3 (remote access
 > error), which is also expected.

I don't see anything about stopping processing -- in the spec I see:

    For RC Service, the CI shall generate a Local Access Violation
    Work Queue Error when the transport layer detects a Request access
    violation at the Responder. The Responder's affiliated QP shall be
    placed in the error state.

so the QP on the target side should just transition into the error
state and flush all requests as usual.

 > Now, on the sending system, I'm seeing send queue entries that do not get
 > completed. The RDMA itself is completed in error; the subsequent SEND
 > is completed (error 5, flushed) as well. But one or more entries seem to
 > remain on the queue - at least my book-keeping says so. I double checked
 > the book-keeping, and it seems accurate...

So it seems that asynchronous events aren't an issue anyway, since the
problem is on the other end?  In any case it shouldn't happen that
send requests don't get flushed, so something is wrong somewhere.

 > >  > I tried destroying the QP first, then we know we can pick off
 > >  > any remaining WRs still allocated. That didn't work, as the card
 > >  > seems to generate interrupts even after the QP is gone. This results
 > >  > in lots of errors on the console complaining about "Completion to
 > >  > bogus CQ".
 > > 
 > > Destroying a QP should immediately stop work processing, so no
 > > completions should be generated once the destroy QP operation
 > > returns.  I don't see how you get the bogus CQ message in this case --
 > > it certainly seems like a driver bug.  Unless you mean you are
 > > destroying the CQ with a QP still attached?  But that shouldn't be
 > > possible because the CQ's usecnt should be non-zero until all attached
 > > QPs are freed.  Not sure what could be going on but it sounds bad...
 > 
 > This may be a driver bug, yes.

If you could send me some code and a recipe to get the "bogus CQ"
message, that might be helpful.  Because as far as I can see, there
shouldn't be any way for a consumer to get that message without a bug
in the low-level driver.  It's fine if it's a whole big RDS test case,
I just want to be able to run the test and instrument the low-level
driver to get a better handle on what's happening.

BTW, what kind of HCA are you using for this testing?

 - R.



More information about the ewg mailing list