[ofa-general] SDP and iWARP

Craig Prescott prescott at hpc.ufl.edu
Thu Jan 10 09:47:54 PST 2008


Steve Wise wrote:
> 
> Craig Prescott wrote:
>> 
>> I patched cma_connect_iw() to create the SDP header as
>> cma_connect_ib() does.  This gets us farther - examining the
>> SDP header on the listener side looks right now, and the
>> listener at least enters rdma_accept(), but iw_cm_accept()
>> fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
>> returning -104.  
> 104 == ECONNRESET, so the client side must have reset the connection.  
> Did this happen after 10 seconds?  (there's a 10 second MPA negiation 
> timeout in the chelsio cm).  Also, a wire trace might be useful.  If 
> this reset happens immediately, then you should look on the client side 
> and see why it reset the connection.

The reset happens after 10 seconds.

Here is tcpdump output from the netperf client host (tebow1):

12:00:17.156120 arp who-has tebow2.hpc.ufl.edu tell tebow1.hpc.ufl.edu
12:00:17.156178 arp reply tebow2.hpc.ufl.edu is-at 00:07:43:05:11:8a 
(oui Unknown)
12:00:27.180401 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
12:00:30.180571 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
12:00:30.180616 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353: 
S 1878582380:1878582380(0) ack 697245481 win 65535 <mss 8960,nop,wscale 3>
12:00:30.180630 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
. ack 1 win 35
12:00:30.255717 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865: 
P 1:257(256) ack 1 win 35
12:00:30.255753 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353: 
. ack 257 win 32736
12:00:30.255763 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353: 
R 1:1(0) ack 257 win 0

On the netserver host (tebow2), we see only the initial arp.

>> The above call also emits a couple of messages
>> into the listener's syslog now :
>>
>> Jan  9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20 
>> opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>> Jan  9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14 
>> status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>
> This is an async event generated due to a failure processing a SQ WR, I 
> think. opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
> type 1 means it was an egress (SQ) failure
> status 0x6 is a base/bounds violation,
> but 14 seems incorrect.  That's not a valid T3 opcode. ????
> 

Ok, thanks!  I guess I'm not sure what to make of that yet, though.

Thanks,
Craig



More information about the general mailing list