[openib-general] Re: problem with SDP/AIO on mem-free HCA
Michael S. Tsirkin
mst at mellanox.co.il
Thu Mar 31 15:10:23 PST 2005
Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: problem with SDP/AIO on mem-free HCA
>
> [err, resending with a correct openib to: line]
>
> I'm hitting a strange problem with SDP/AIO on a mem-free Arbel. My
> test is the following: I run Libor's ttcp.aio program with default
> parameters (which I think just leaves one AIO in flight at a time) as
> follows:
>
> ttcp.aio.x -r -s &
> ttcp.aio.x -t -s 127.0.0.1
>
> This always fails with a remote access error exactly 256K into the
> test. I see the following in my log (with some extra tracing added to
> SDP to get info on the RDMAs being posted):
>
> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at =
> <1d94e000>/<1000>
> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at =
> <1d94f000>/<1000>
> WARN: <2> <050e:11b1> Posting SEND, wrid <5f>
> WARN: <1> <050e:11b1> Posting SEND, wrid <20>
> CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
> <0d000002:8001>
> CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
> <0d000002:8001>
> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at =
> <1d94e000>/<0>
> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at =
> <1d94e000>/<1000>
> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at =
> <1d94f000>/<1000>
> ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ =
> 00001803
> [ 0] 00000407
> [ 4] b3000000
> [ 8] fd000003
> [ c] 110000c0
> [10] 13880000
> [14] 00000010
> [18] 00001803
> [1c] ff100000
> WARN: : Unhandled status <10> unknown event <-1> wrid <60>
>
> As you can see, the failed work request is an RDMA with length 0. The
> previous work request with wrid 5d with the same R_Key and remote
> address but a length of 0x1000 appears to complete successfully so the
> FMR seems to be OK.
>
> So I guess there are two questions:
> - why is SDP doing a zero-length RDMA read?
> - is it correct for this to fail with a remote access error?
> I have not had a chance to test zero-length RDMA without involving
> FMRs but I don't think the FMR code is to blame.
I dont think so.
I found this:
C9-88: For an HCA responder using Reliable Connection service, for
each zero-length RDMA READ or WRITE request, the R_Key shall not be
validated, even if the request includes Immediate data.
Can it be you generate a non-zero RDMA in mthca.
> Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it
> switches on comp->opcode even when comp->status is not success.
> However, if the comp->status is not success, then per the IB spec,
> mthca does not set the comp->opcode field.
>
> - R.
>
--
MST - Michael S. Tsirkin
More information about the general
mailing list