[openib-general] Re: problem with SDP/AIO on mem-free HCA

Michael S. Tsirkin mst at mellanox.co.il
Thu Mar 31 15:10:23 PST 2005


Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: problem with SDP/AIO on mem-free HCA
> 
> [err, resending with a correct openib to: line]
> 
> I'm hitting a strange problem with SDP/AIO on a mem-free Arbel.  My
> test is the following: I run Libor's ttcp.aio program with default
> parameters (which I think just leaves one AIO in flight at a time) as
> follows:
> 
>  ttcp.aio.x -r -s &
>  ttcp.aio.x -t -s 127.0.0.1
> 
> This always fails with a remote access error exactly 256K into the
> test.  I see the following in my log (with some extra tracing added to
> SDP to get info on the RDMAs being posted):
> 
>     WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at =
> <1d94e000>/<1000>
>     WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at =
> <1d94f000>/<1000>
>     WARN: <2> <050e:11b1> Posting SEND, wrid <5f>
>     WARN: <1> <050e:11b1> Posting SEND, wrid <20>
>     CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
> <0d000002:8001>
>     CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
> <0d000002:8001>
>     WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at =
> <1d94e000>/<0>
>     WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at =
> <1d94e000>/<1000>
>     WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at =
> <1d94f000>/<1000>
>     ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ =
> 00001803
>       [ 0] 00000407
>       [ 4] b3000000
>       [ 8] fd000003
>       [ c] 110000c0
>       [10] 13880000
>       [14] 00000010
>       [18] 00001803
>       [1c] ff100000
>     WARN: : Unhandled status <10> unknown event <-1> wrid <60>
> 
> As you can see, the failed work request is an RDMA with length 0.  The
> previous work request with wrid 5d with the same R_Key and remote
> address but a length of 0x1000 appears to complete successfully so the
> FMR seems to be OK.
> 
> So I guess there are two questions:
>  - why is SDP doing a zero-length RDMA read?
>  - is it correct for this to fail with a remote access error?
>    I have not had a chance to test zero-length RDMA without involving
>    FMRs but I don't think the FMR code is to blame.

I dont think so.
I found this:

C9-88: For an HCA responder using Reliable Connection service, for
each zero-length RDMA READ or WRITE request, the R_Key shall not be
validated, even if the request includes Immediate data.

Can it be you generate a non-zero RDMA in mthca.


> Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it
> switches on comp->opcode even when comp->status is not success.
> However, if the comp->status is not success, then per the IB spec,
> mthca does not set the comp->opcode field.
> 
>  - R.
> 

-- 
MST - Michael S. Tsirkin



More information about the general mailing list