[ofa-general] SDP and iWARP

Steve Wise swise at opengridcomputing.com
Thu Jan 24 09:30:52 PST 2008


Are these recv buffers user memory or kernel memory?  I just submitted a 
fix for a bug in build_phys_page_list().  Perhaps you're hitting this? 
It would hit it if these are buffers allocated by the sdp kernel module 
and registered via ib_reg_phys_mr().

Alsoalso: If sdp is using ib_get_dma_mr() to access all of memory, then 
it won't work with the chelsio driver, which has a 4GB limit on MRs.  So 
  cxgb3 creates dma_mrs that map only address 0..4GB-1.  This just 
doesn't work at all if there is an iommu mapping bus addresses above 4GB.

Steve.



Craig Prescott wrote:
> 
> Hi Felix;
> 
> Here are the last 4 WRs:
> 
> ...
> Entering iwch_post_receive
> iwch_post_receive: Dumping built work request before ring_doorbell:
> iwch_post_receive: WQE ffff810241d59e00: 17c001008000000d
> iwch_post_receive: WQE ffff810241d59e08: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e10: 0000000000000001
> iwch_post_receive: WQE ffff810241d59e18: 000002ff00000810
> iwch_post_receive: WQE ffff810241d59e20: 000000044eac3000
> iwch_post_receive: WQE ffff810241d59e28: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e30: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e38: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e40: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e48: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e50: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e58: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e60: 0000000000000000
> iwch_post_receive: returning 0
> Entering iwch_post_receive
> iwch_post_receive: Dumping built work request before ring_doorbell:
> iwch_post_receive: WQE ffff810241d59e80: 17c001008000000d
> iwch_post_receive: WQE ffff810241d59e88: 0000000000000000
> iwch_post_receive: WQE ffff810241d59e90: 0000000000000001
> iwch_post_receive: WQE ffff810241d59e98: 000002ff00000810
> iwch_post_receive: WQE ffff810241d59ea0: 000000044eac4000
> iwch_post_receive: WQE ffff810241d59ea8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59eb0: 0000000000000000
> iwch_post_receive: WQE ffff810241d59eb8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59ec0: 0000000000000000
> iwch_post_receive: WQE ffff810241d59ec8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59ed0: 0000000000000000
> iwch_post_receive: WQE ffff810241d59ed8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59ee0: 0000000000000000
> iwch_post_receive: returning 0
> Entering iwch_post_receive
> iwch_post_receive: Dumping built work request before ring_doorbell:
> iwch_post_receive: WQE ffff810241d59f00: 17c001008000000d
> iwch_post_receive: WQE ffff810241d59f08: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f10: 0000000000000001
> iwch_post_receive: WQE ffff810241d59f18: 000002ff00000810
> iwch_post_receive: WQE ffff810241d59f20: 000000044eac5000
> iwch_post_receive: WQE ffff810241d59f28: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f30: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f38: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f40: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f48: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f50: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f58: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f60: 0000000000000000
> iwch_post_receive: returning 0
> Entering iwch_post_receive
> iwch_post_receive: Dumping built work request before ring_doorbell:
> iwch_post_receive: WQE ffff810241d59f80: 17c001008000000d
> iwch_post_receive: WQE ffff810241d59f88: 0000000000000000
> iwch_post_receive: WQE ffff810241d59f90: 0000000000000001
> iwch_post_receive: WQE ffff810241d59f98: 000002ff00000810
> iwch_post_receive: WQE ffff810241d59fa0: 000000044eac6000
> iwch_post_receive: WQE ffff810241d59fa8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fb0: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fb8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fc0: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fc8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fd0: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fd8: 0000000000000000
> iwch_post_receive: WQE ffff810241d59fe0: 0000000000000000
> iwch_post_receive: returning 0
> 
> Thanks,
> Craig
> 
> 
> Felix Marti wrote:
>> Hi Craig,
>>
>> Can you please dump not only the last, but the last 4 WRs?
>>
>> Thanks,
>> felix
>>
>>> -----Original Message-----
>>> From: general-bounces at lists.openfabrics.org [mailto:general-
>>> bounces at lists.openfabrics.org] On Behalf Of Craig Prescott
>>> Sent: Wednesday, January 23, 2008 8:05 AM
>>> To: Steve Wise
>>> Cc: general at lists.openfabrics.org
>>> Subject: Re: [ofa-general] SDP and iWARP
>>>
>>> Steve Wise wrote:
>>>> Craig Prescott wrote:
>>>>> Steve Wise wrote:
>>>>>> Craig Prescott wrote:
>>>>>>> Steve Wise wrote:
>>>>>>>> Craig Prescott wrote:
>>>>>>>>> The above call also emits a couple of messages
>>>>>>>>> into the listener's syslog now :
>>>>>>>>>
>>>>>>>>> Jan  9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid
>>>>>>>>> 0x20 opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>>>>>>>> Jan  9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20
>>> opcode
>>>>>>>>> 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
>>>>>>>>>
>>>>>>>> This is an async event generated due to a failure processing a
>> SQ
>>>>>>>> WR, I think. opcodes and status codes for iw_cxgb3 are in
>>> cxio_wr.h.
>>>>>>>> type 1 means it was an egress (SQ) failure
>>>>>>>> status 0x6 is a base/bounds violation,
>>>>>>>> but 14 seems incorrect.  That's not a valid T3 opcode. ????
>>>>>>>>
>>>>>>> Ok, thanks!  I guess I'm not sure what to make of that yet,
>>> though.
>>>>>> See where in iwch_accept_cr() the failure is happening.  It
>> doesn't
>>>>>> look like send_mpa_reply() is being called.
>>>>>>
>>>>> The ECONNRESET is coming from here in iwch_accept_cr():
>>>>>
>>>>> ...
>>>>>         /* wait for wr_ack */
>>>>>         wait_event(ep->com.waitq, ep->com.rpl_done);
>>>>>         err = ep->com.rpl_err;
>>>>> ...
>>>>>
>>>>> Is that what you thought was happening?
>>>> I don't know exactly what is going on!  But the code above means
>> that
>>>> the firmware never successfully sent the last streaming message (the
>>>> mpa-start reply) and never transitioned the connection into rdma
>>> mode.
>>>> And the async error might indicate that some WR was posted prior to
>>>> doing the rdma_accept() and that WR had problems.
>>> Ok.  I'm sorry for such a slow response.
>>>
>>>> a few questions:
>>>>
>>>> What firmware are you running?  ethtool -i will tell you.
>>> [root at tebow1 ~]# ethtool -i eth4
>>> driver: cxgb3
>>> version: 1.0-ko
>>> firmware-version: T 5.0.0 TP 1.1.0
>>> bus-info: 0000:86:00.0
>>>
>>>> What ofed version exactly?
>>> OFED 1.3 daily from a few weeks back now: OFED-1.3-20080107-0942
>>>
>>>> Does sdp post a SQ or RQ WR prior to doing the rdma_accept()?  Can
>>> you
>>>> dump that work request?  Maybe in iwch_post_send and iwch_post_recv,
>>>> dump the work request after it is built and before the code rings
>> the
>>>> doorbell.  You can dump it as 8B flits, and be sure an put the flits
>>> in
>>>> host byte order.  See cxio_dump_wqe() in cxio_dbg.c...
>>> The following is the last work request seen before rdma_accept():
>>>
>>> iwch_post_receive: Dumping built work request before ring_doorbell:
>>> iwch_post_receive: WQE ffff810241d59f80: 17c001008000000d
>>> iwch_post_receive: WQE ffff810241d59f88: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59f90: 0000000000000001
>>> iwch_post_receive: WQE ffff810241d59f98: 000002ff00000810
>>> iwch_post_receive: WQE ffff810241d59fa0: 000000044eac6000
>>> iwch_post_receive: WQE ffff810241d59fa8: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fb0: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fb8: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fc0: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fc8: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fd0: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fd8: 0000000000000000
>>> iwch_post_receive: WQE ffff810241d59fe0: 0000000000000000
>>> iwch_post_receive: returning 0
>>>
>>> This comes from sdp_init_qp(), via sdp_connect_handler().
>>> There are a total of 64 work requests (all from
>>> iwch_post_receive()) generated while the netserver is
>>> trying to handle the RDMA_CM_EVENT_CONNECT_REQUEST.
>>>
>>> Can you help me decode the above work request?
>>>
>>> Thanks,
>>> Craig
>>>
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-
>>> general




More information about the general mailing list