[openib-general] Re: uDAPL again
Aniruddha Bohra
bohra at cs.rutgers.edu
Wed Nov 2 12:44:22 PST 2005
Arlin Davis wrote:
> Aniruddha Bohra wrote:
>
>> cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil)
>> Success^M
>> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M
>> dapl_evd_dto_callback : CQE ^M
>> work_req_id 134771572^M
>> status 12^M
>> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M
>> DTO completion ERROR: 12: op 0xff^M
>> disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M
>> destroy_cm_id: conn 0x808a008 id 134774528^M
>> dapli_evd_post_event: Called with event # 4006^M
>>
>>
>> Any ideas how to proceed to even debug this ?
>
>
>
> Are you using the uDAPL provider with socket CM (VERBS=openib_scm) or
> the default one that use's uCM and uAT? For the socket_CM version the
> timeout is set to 14 (~67ms) and the retries are set to 7 so the
> receiving node would have to be delayed beyond ~469ms to get this
> failure. For the default uCM/uAT version the retries are set to 7 and
> the timeout is set to pktlifetime+1 so you would have to look at the
> path-record for the timeout value for the connection.
>
I am using the default one. Actually, even the dapl_ep_connect() takes a
long time.
I am not sure, but arent uCM and uAT simply for connection establishment?
> Can you successfully run the IB verbs ibv_rc_pingpong test suite?
Between the two OpenIB nodes, I can run the ibv_rc_pingpong.
> Anything special about your fabric configuration that could induce
> this kind of latencies? Something on the fabric or in your remote
> system is delaying ACK's beyond your total timeout/retry times.
It has 3 machines on the switch : one is a netapp filer, which might be
the source of the problem :(
>
> If you had no buffers posted or attempted to send to unregistered
> memory you would get different errors.
This is good, as it seems my code is trying to DTRT :)
Thanks
Aniruddha
More information about the general
mailing list