[ofa-general] SDP and iWARP
Craig Prescott
prescott at hpc.ufl.edu
Wed Jan 23 08:05:00 PST 2008
Steve Wise wrote:
> Craig Prescott wrote:
>> Steve Wise wrote:
>>> Craig Prescott wrote:
>>>> Steve Wise wrote:
>>>>>
>>>>> Craig Prescott wrote:
>>>>>>
>>>>>> The above call also emits a couple of messages
>>>>>> into the listener's syslog now :
>>>>>>
>>>>>> Jan 9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid
>>>>>> 0x20 opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>>>>> Jan 9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode
>>>>>> 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>>>>>
>>>>> This is an async event generated due to a failure processing a SQ
>>>>> WR, I think. opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
>>>>> type 1 means it was an egress (SQ) failure
>>>>> status 0x6 is a base/bounds violation,
>>>>> but 14 seems incorrect. That's not a valid T3 opcode. ????
>>>>>
>>>>
>>>> Ok, thanks! I guess I'm not sure what to make of that yet, though.
>>>>
>>>
>>> See where in iwch_accept_cr() the failure is happening. It doesn't
>>> look like send_mpa_reply() is being called.
>>>
>>
>> The ECONNRESET is coming from here in iwch_accept_cr():
>>
>> ...
>> /* wait for wr_ack */
>> wait_event(ep->com.waitq, ep->com.rpl_done);
>> err = ep->com.rpl_err;
>> ...
>>
>> Is that what you thought was happening?
>
> I don't know exactly what is going on! But the code above means that
> the firmware never successfully sent the last streaming message (the
> mpa-start reply) and never transitioned the connection into rdma mode.
> And the async error might indicate that some WR was posted prior to
> doing the rdma_accept() and that WR had problems.
Ok. I'm sorry for such a slow response.
> a few questions:
>
> What firmware are you running? ethtool -i will tell you.
[root at tebow1 ~]# ethtool -i eth4
driver: cxgb3
version: 1.0-ko
firmware-version: T 5.0.0 TP 1.1.0
bus-info: 0000:86:00.0
> What ofed version exactly?
OFED 1.3 daily from a few weeks back now: OFED-1.3-20080107-0942
> Does sdp post a SQ or RQ WR prior to doing the rdma_accept()? Can you
> dump that work request? Maybe in iwch_post_send and iwch_post_recv,
> dump the work request after it is built and before the code rings the
> doorbell. You can dump it as 8B flits, and be sure an put the flits in
> host byte order. See cxio_dump_wqe() in cxio_dbg.c...
The following is the last work request seen before rdma_accept():
iwch_post_receive: Dumping built work request before ring_doorbell:
iwch_post_receive: WQE ffff810241d59f80: 17c001008000000d
iwch_post_receive: WQE ffff810241d59f88: 0000000000000000
iwch_post_receive: WQE ffff810241d59f90: 0000000000000001
iwch_post_receive: WQE ffff810241d59f98: 000002ff00000810
iwch_post_receive: WQE ffff810241d59fa0: 000000044eac6000
iwch_post_receive: WQE ffff810241d59fa8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fb0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fb8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fc0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fc8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fd0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fd8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fe0: 0000000000000000
iwch_post_receive: returning 0
This comes from sdp_init_qp(), via sdp_connect_handler().
There are a total of 64 work requests (all from
iwch_post_receive()) generated while the netserver is
trying to handle the RDMA_CM_EVENT_CONNECT_REQUEST.
Can you help me decode the above work request?
Thanks,
Craig
More information about the general
mailing list