[openib-general] problem with SDP/AIO on mem-free HCA

Roland Dreier roland at topspin.com
Thu Mar 31 12:53:48 PST 2005


[err, resending with a correct openib to: line]

I'm hitting a strange problem with SDP/AIO on a mem-free Arbel.  My
test is the following: I run Libor's ttcp.aio program with default
parameters (which I think just leaves one AIO in flight at a time) as
follows:

 ttcp.aio.x -r -s &
 ttcp.aio.x -t -s 127.0.0.1

This always fails with a remote access error exactly 256K into the
test.  I see the following in my log (with some extra tracing added to
SDP to get info on the RDMAs being posted):

    WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at =
<1d94e000>/<1000>
    WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at =
<1d94f000>/<1000>
    WARN: <2> <050e:11b1> Posting SEND, wrid <5f>
    WARN: <1> <050e:11b1> Posting SEND, wrid <20>
    CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
<0d000002:8001>
    CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst =
<0d000002:8001>
    WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at =
<1d94e000>/<0>
    WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at =
<1d94e000>/<1000>
    WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at =
<1d94f000>/<1000>
    ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ =
00001803
      [ 0] 00000407
      [ 4] b3000000
      [ 8] fd000003
      [ c] 110000c0
      [10] 13880000
      [14] 00000010
      [18] 00001803
      [1c] ff100000
    WARN: : Unhandled status <10> unknown event <-1> wrid <60>

As you can see, the failed work request is an RDMA with length 0.  The
previous work request with wrid 5d with the same R_Key and remote
address but a length of 0x1000 appears to complete successfully so the
FMR seems to be OK.

So I guess there are two questions:
 - why is SDP doing a zero-length RDMA read?
 - is it correct for this to fail with a remote access error?
   I have not had a chance to test zero-length RDMA without involving
   FMRs but I don't think the FMR code is to blame.

Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it
switches on comp->opcode even when comp->status is not success.
However, if the comp->status is not success, then per the IB spec,
mthca does not set the comp->opcode field.

 - R.






More information about the general mailing list