[openib-general] Re: uDAPL again

Wed Nov 2 12:02:57 PST 2005

Aniruddha Bohra wrote:

> cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) 
> Success^M
>         >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M
>         dapl_evd_dto_callback : CQE ^M
>                 work_req_id 134771572^M
>                 status 12^M
>         >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M
> DTO completion ERROR: 12: op 0xff^M
> disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M
> destroy_cm_id: conn 0x808a008 id 134774528^M
> dapli_evd_post_event: Called with event # 4006^M
>
>
> Any ideas how to proceed to even debug this ?

Are you using the uDAPL provider with socket CM (VERBS=openib_scm) or 
the default one that use's uCM and uAT?  For the socket_CM version the 
timeout is set to 14 (~67ms) and the retries are set to 7 so the 
receiving node would have to be delayed beyond ~469ms to get this 
failure. For the default uCM/uAT version the retries are set to 7 and 
the timeout is set to pktlifetime+1 so you would have to look at the 
path-record for the timeout value for the connection.

Can you successfully run the IB verbs ibv_rc_pingpong test suite?  
Anything special about your fabric configuration that could induce this 
kind of latencies? Something on the fabric or in your remote system is 
delaying ACK's beyond your total timeout/retry times.

If you had no buffers posted or attempted to send to unregistered memory 
you would get different errors.

-arlin

>
> Thanks
> Aniruddha
>