[ofa-general] SDP and iWARP

Steve Wise swise at opengridcomputing.com
Thu Jan 10 07:19:48 PST 2008



Craig Prescott wrote:
> Steve Wise wrote:
>>
>> First make sure the sdp kernel module uses the rdma cma.  Then I'd 
>> add printk hooks in cma.c, addr.c, and iwcm.c to see what's going on 
>> and where things are failing.  Also a wire trace is good if we're 
>> getting that far (like at least doing arp resolution).
>>
>
> Small update - a little progress.  printk's spinkled liberally and
> ib_sdp debug options turned on.  The initial problem was on the
> listener during an IW_CM_EVENT_CONNECT_REQUEST event; the SDP  hello 
> header was rejected in sdp_cma.c:sdp_connect_handler() because its
> max_adverts field was zero, which is not permissible.  In fact, all
> of the sdp_hh fields were zero.
>
> Comparing with the RDMA_TRANSPORT_IB case, I saw that 
> cma.c:cma_connect_ib() does some work to create the SDP header
> via cma_format_hdr().  But cma_connect_iw() did not.
>
Why is this SDP protocol stuff done in the CMA??  That's seems like a 
layer violation...
> I patched cma_connect_iw() to create the SDP header as
> cma_connect_ib() does.  This gets us farther - examining the
> SDP header on the listener side looks right now, and the
> listener at least enters rdma_accept(), but iw_cm_accept()
> fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
> returning -104.  
104 == ECONNRESET, so the client side must have reset the connection.  
Did this happen after 10 seconds?  (there's a 10 second MPA negiation 
timeout in the chelsio cm).  Also, a wire trace might be useful.  If 
this reset happens immediately, then you should look on the client side 
and see why it reset the connection. 

> The above call also emits a couple of messages
> into the listener's syslog now :
>
> Jan  9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20 
> opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
> Jan  9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14 
> status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>
This is an async event generated due to a failure processing a SQ WR, I 
think. 
opcodes and status codes for iw_cxgb3 are in cxio_wr.h. 

type 1 means it was an egress (SQ) failure
status 0x6 is a base/bounds violation,
but 14 seems incorrect.  That's not a valid T3 opcode. ???? 


 
> In the end, we still end up in rdma_reject().  Will keep digging.
>
> Cheers,
> Craig



More information about the general mailing list