[ofa-general] SDP and iWARP
Craig Prescott
prescott at hpc.ufl.edu
Thu Jan 10 09:47:54 PST 2008
Steve Wise wrote:
>
> Craig Prescott wrote:
>>
>> I patched cma_connect_iw() to create the SDP header as
>> cma_connect_ib() does. This gets us farther - examining the
>> SDP header on the listener side looks right now, and the
>> listener at least enters rdma_accept(), but iw_cm_accept()
>> fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
>> returning -104.
> 104 == ECONNRESET, so the client side must have reset the connection.
> Did this happen after 10 seconds? (there's a 10 second MPA negiation
> timeout in the chelsio cm). Also, a wire trace might be useful. If
> this reset happens immediately, then you should look on the client side
> and see why it reset the connection.
The reset happens after 10 seconds.
Here is tcpdump output from the netperf client host (tebow1):
12:00:17.156120 arp who-has tebow2.hpc.ufl.edu tell tebow1.hpc.ufl.edu
12:00:17.156178 arp reply tebow2.hpc.ufl.edu is-at 00:07:43:05:11:8a
(oui Unknown)
12:00:27.180401 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
12:00:30.180571 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
12:00:30.180616 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353:
S 1878582380:1878582380(0) ack 697245481 win 65535 <mss 8960,nop,wscale 3>
12:00:30.180630 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
. ack 1 win 35
12:00:30.255717 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
P 1:257(256) ack 1 win 35
12:00:30.255753 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353:
. ack 257 win 32736
12:00:30.255763 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353:
R 1:1(0) ack 257 win 0
On the netserver host (tebow2), we see only the initial arp.
>> The above call also emits a couple of messages
>> into the listener's syslog now :
>>
>> Jan 9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20
>> opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>> Jan 9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14
>> status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>
> This is an async event generated due to a failure processing a SQ WR, I
> think. opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
> type 1 means it was an egress (SQ) failure
> status 0x6 is a base/bounds violation,
> but 14 seems incorrect. That's not a valid T3 opcode. ????
>
Ok, thanks! I guess I'm not sure what to make of that yet, though.
Thanks,
Craig
More information about the general
mailing list