[ofa-general] SDP and iWARP

Steve Wise swise at opengridcomputing.com
Thu Jan 10 10:05:15 PST 2008


Craig Prescott wrote:
> Steve Wise wrote:
>>
>> Craig Prescott wrote:
>>>
>>> I patched cma_connect_iw() to create the SDP header as
>>> cma_connect_ib() does.  This gets us farther - examining the
>>> SDP header on the listener side looks right now, and the
>>> listener at least enters rdma_accept(), but iw_cm_accept()
>>> fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
>>> returning -104.  
>> 104 == ECONNRESET, so the client side must have reset the connection.  
>> Did this happen after 10 seconds?  (there's a 10 second MPA negiation 
>> timeout in the chelsio cm).  Also, a wire trace might be useful.  If 
>> this reset happens immediately, then you should look on the client 
>> side and see why it reset the connection.
> 
> The reset happens after 10 seconds.
> 
> Here is tcpdump output from the netperf client host (tebow1):
> 
> 12:00:17.156120 arp who-has tebow2.hpc.ufl.edu tell tebow1.hpc.ufl.edu
> 12:00:17.156178 arp reply tebow2.hpc.ufl.edu is-at 00:07:43:05:11:8a 
> (oui Unknown)
> 12:00:27.180401 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
> S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
> 12:00:30.180571 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
> S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
> 12:00:30.180616 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353: 
> S 1878582380:1878582380(0) ack 697245481 win 65535 <mss 8960,nop,wscale 3>
> 12:00:30.180630 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
> . ack 1 win 35
> 12:00:30.255717 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
> P 1:257(256) ack 1 win 35

The above packet is the mpa-start with the SDP hello as private data, I 
think.

> 12:00:30.255753 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353: 
> . ack 257 win 32736
> 12:00:30.255763 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353: 
> R 1:1(0) ack 257 win 0

And then nothing happens from the listening side, so the mpa-start reply 
never comes out.

> 
> On the netserver host (tebow2), we see only the initial arp.
> 
>>> The above call also emits a couple of messages
>>> into the listener's syslog now :
>>>
>>> Jan  9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20 
>>> opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>> Jan  9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14 
>>> status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>>
>> This is an async event generated due to a failure processing a SQ WR, 
>> I think. opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
>> type 1 means it was an egress (SQ) failure
>> status 0x6 is a base/bounds violation,
>> but 14 seems incorrect.  That's not a valid T3 opcode. ????
>>
> 
> Ok, thanks!  I guess I'm not sure what to make of that yet, though.
> 

See where in iwch_accept_cr() the failure is happening.  It doesn't look 
like send_mpa_reply() is being called.




More information about the general mailing list