[ofa-general] SDP and iWARP
Steve Wise
swise at opengridcomputing.com
Thu Jan 10 07:19:48 PST 2008
Craig Prescott wrote:
> Steve Wise wrote:
>>
>> First make sure the sdp kernel module uses the rdma cma. Then I'd
>> add printk hooks in cma.c, addr.c, and iwcm.c to see what's going on
>> and where things are failing. Also a wire trace is good if we're
>> getting that far (like at least doing arp resolution).
>>
>
> Small update - a little progress. printk's spinkled liberally and
> ib_sdp debug options turned on. The initial problem was on the
> listener during an IW_CM_EVENT_CONNECT_REQUEST event; the SDP hello
> header was rejected in sdp_cma.c:sdp_connect_handler() because its
> max_adverts field was zero, which is not permissible. In fact, all
> of the sdp_hh fields were zero.
>
> Comparing with the RDMA_TRANSPORT_IB case, I saw that
> cma.c:cma_connect_ib() does some work to create the SDP header
> via cma_format_hdr(). But cma_connect_iw() did not.
>
Why is this SDP protocol stuff done in the CMA?? That's seems like a
layer violation...
> I patched cma_connect_iw() to create the SDP header as
> cma_connect_ib() does. This gets us farther - examining the
> SDP header on the listener side looks right now, and the
> listener at least enters rdma_accept(), but iw_cm_accept()
> fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
> returning -104.
104 == ECONNRESET, so the client side must have reset the connection.
Did this happen after 10 seconds? (there's a 10 second MPA negiation
timeout in the chelsio cm). Also, a wire trace might be useful. If
this reset happens immediately, then you should look on the client side
and see why it reset the connection.
> The above call also emits a couple of messages
> into the listener's syslog now :
>
> Jan 9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20
> opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
> Jan 9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14
> status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>
This is an async event generated due to a failure processing a SQ WR, I
think.
opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
type 1 means it was an egress (SQ) failure
status 0x6 is a base/bounds violation,
but 14 seems incorrect. That's not a valid T3 opcode. ????
> In the end, we still end up in rdma_reject(). Will keep digging.
>
> Cheers,
> Craig
More information about the general
mailing list