[openib-general] problem with SDP/AIO on mem-free HCA
Roland Dreier
roland at topspin.com
Thu Mar 31 12:53:48 PST 2005
[err, resending with a correct openib to: line]
I'm hitting a strange problem with SDP/AIO on a mem-free Arbel. My
test is the following: I run Libor's ttcp.aio program with default
parameters (which I think just leaves one AIO in flight at a time) as
follows:
ttcp.aio.x -r -s &
ttcp.aio.x -t -s 127.0.0.1
This always fails with a remote access error exactly 256K into the
test. I see the following in my log (with some extra tracing added to
SDP to get info on the RDMAs being posted):
WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at =
<1d94e000>/<1000>
WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at =
<1d94f000>/<1000>
WARN: <2> <050e:11b1> Posting SEND, wrid <5f>
WARN: <1> <050e:11b1> Posting SEND, wrid <20>
CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
<0d000002:8001>
CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
<0d000002:8001>
WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at =
<1d94e000>/<0>
WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at =
<1d94e000>/<1000>
WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at =
<1d94f000>/<1000>
ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ =
00001803
[ 0] 00000407
[ 4] b3000000
[ 8] fd000003
[ c] 110000c0
[10] 13880000
[14] 00000010
[18] 00001803
[1c] ff100000
WARN: : Unhandled status <10> unknown event <-1> wrid <60>
As you can see, the failed work request is an RDMA with length 0. The
previous work request with wrid 5d with the same R_Key and remote
address but a length of 0x1000 appears to complete successfully so the
FMR seems to be OK.
So I guess there are two questions:
- why is SDP doing a zero-length RDMA read?
- is it correct for this to fail with a remote access error?
I have not had a chance to test zero-length RDMA without involving
FMRs but I don't think the FMR code is to blame.
Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it
switches on comp->opcode even when comp->status is not success.
However, if the comp->status is not success, then per the IB spec,
mthca does not set the comp->opcode field.
- R.
More information about the general
mailing list