[ofw] RE: Disconnection problem and AL reference

James Yang jyang at xsigo.com
Tue Oct 21 13:52:24 PDT 2008


Hi Ishai,

I will try to open a bug for it. We did try to port some of the changes of WinOF2.0, but they didn't help. 

Thanks,
James

-----Original Message-----
From: Ishai Rabinovitz [mailto:ishai at mellanox.co.il] 
Sent: Sunday, October 19, 2008 4:32 AM
To: James Yang; Fab Tillier; ofw at lists.openfabrics.org
Subject: RE: [ofw] RE: Disconnection problem and AL reference

James:
Can you please open a bug about it in Bugzilla
(https://bugs.openfabrics.org/) - this way we will have all the data in
one place.
Do you see the problem also with the latest RC of WinOF 2.0?

Thanks
Ishai

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of James Yang
> Sent: Thursday, October 16, 2008 12:02 AM
> To: Fab Tillier; ofw at lists.openfabrics.org
> Subject: [ofw] RE: Disconnection problem and AL reference
> 
> 
> Hi Fab,
> 
> QP ref is 1 before calling cm_dreq(). 
> 
> ----here is the call---
> disconnectRequest.h_qp = qpHandle;
> disconnectRequest.flags = 0;
> disconnectRequest.p_dreq_pdata = NULL;
> disconnectRequest.dreq_length = 0;
> disconnectRequest.qp_type = IB_QPT_RELIABLE_CONN; 
> disconnectRequest.pfn_cm_drep_cb = IBConnectionDisconnectCallback;
> 
> cm_dreq(&disconnectRequest);
> ----------------------------
> 
> During dreq() call, in al_cep_dreq() the p_cep->state is 
> CEP_STATE_TIMEWAIT. It caused the QP is not touched. Maybe 
> that's reason callback never got called.
> 
> But why we got a CEP_STATE_TIMEWAIT state? Some previous 
> commands cause the problem? Here is the structure, any data suspicous?
> 
> ibbus_fffffadf24334000!_al_kcep
>    +0x000 cid              : 1
>    +0x008 context          : 0xfffffadf`3682aa10 
>    +0x010 p_cid            : 0xfffffadf`37d69030 _cep_cid
>    +0x018 sid              : 0x1971302`00000000
>    +0x020 port_guid        : 0
>    +0x028 p_cmp_buf        : (null) 
>    +0x030 cmp_offset       : 0 ''
>    +0x031 cmp_len          : 0 ''
>    +0x034 p2p              : 0
>    +0x038 al_item          : _cl_list_item
>    +0x050 signalled        : 1
>    +0x058 pfn_destroy_cb   : (null) 
>    +0x060 p_mad_head       : (null) 
>    +0x068 p_mad_tail       : 0xfffffadf`364ec558 _ib_mad_element
>    +0x070 pfn_cb           : 0xfffffadf`243ad050     void  
> ibbus_fffffadf24334000!__cm_handler+0
>    +0x078 p_irp            : (null) 
>    +0x080 listen_item      : _cl_rbmap_item
>    +0x0a8 rem_id_item      : _cl_rbmap_item
>    +0x0d0 rem_qp_item      : _cl_rbmap_item
>    +0x0f8 local_comm_id    : 0x2000001
>    +0x0fc remote_comm_id   : 0x831d3e8b
>    +0x100 local_ca_guid    : 0xdc8c0200`03c90200
>    +0x108 remote_ca_guid   : 0xe0000001`02971300
>    +0x110 remote_qpn       : 0xa04b700
>    +0x114 sq_psn           : 0xa04b700
>    +0x118 rq_psn           : 0x48000000
>    +0x11c resp_res         : 0x4 ''
>    +0x11d init_depth       : 0x4 ''
>    +0x11e rnr_nak_timeout  : 0x8 ''
>    +0x120 local_qpn        : 0x48000000
>    +0x124 pkey             : 0xffff
>    +0x126 req_init_depth   : 0 ''
>    +0x128 av               : [2] _al_kcep_av
>    +0x1b8 idx_primary      : 0 ''
>    +0x1c0 alt_av           : _al_kcep_av
>    +0x208 alt_2pkt_life    : 0 ''
>    +0x209 max_2pkt_life    : 0x13 ''
>    +0x20a target_ack_delay : 0x14 ''
>    +0x20b local_ack_delay  : 0xf ''
>    +0x20c state            : 3 ( CEP_STATE_TIMEWAIT )
>    +0x210 was_active       : 1
>    +0x218 h_mad_svc        : 0xfffffadf`36845730 _al_mad_svc
>    +0x220 p_send_mad       : (null) 
>    +0x228 ref_cnt          : 1
>    +0x230 tid              : 0x8b3e1d83`06000000
>    +0x238 max_cm_retries   : 0x3 ''
>    +0x23c retry_timeout    : 0x1920
>    +0x240 timewait_timer   : _KTIMER
>    +0x280 timewait_time    : _LARGE_INTEGER 0xffffffff`fc28f600
>    +0x288 timewait_item    : _cl_list_item
>    +0x2a0 p_mad            : (null) 
>    +0x2a8 mads             : _mads
>    +0x3a8 irp_que          : _LIST_ENTRY [ 
> 0xfffffadf`3682a928 - 0xfffffadf`3682a928 ]
>    +0x3b8 psize            : 0 ''
>    +0x3b9 pdata            : [196]  ""
> 
> Thanks,
> James
> 
> 
> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
> Sent: Wednesday, October 15, 2008 12:41 PM
> To: James Yang; ofw at lists.openfabrics.org
> Subject: RE: Disconnection problem and AL reference
> 
> Hi James,
> 
> > After calling cm_dreq(), my callback for it is not called and the 
> > workitems are also never get called. After cm_dreq() time 
> out, I call
> > destroy_qp() with status successful. But the reference 
> count of QP is 
> > always 1.
> 
> When cm_dreq times out, you should get a DREP notification, 
> and the QP should be in the error state.
> 
> Check the reference count on the QP before you call cm_dreq.  
> If you can, also check the reference count on the CEP for 
> your QP.  When the cm_dreq times out (why is it timing out, 
> did the other side not reply?) again check the CEP reference 
> count.  The timeout is processed in the __cep_mad_send_cb 
> function in al_cm_cep.c.  Then walk the code to make sure the 
> DREP callback is invoked, or if it isn't, why.  The CEP takes 
> a reference on the QP, make sure that gets released when the 
> QP is destroyed (QP destruction should destroy the CEP in the 
> destroying callback for the QP object)
> 
> -Fab
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 




More information about the ofw mailing list