[ofw] RE: Disconnection problem and AL reference
James Yang
jyang at xsigo.com
Wed Oct 15 15:02:11 PDT 2008
Hi Fab,
QP ref is 1 before calling cm_dreq().
----here is the call---
disconnectRequest.h_qp = qpHandle;
disconnectRequest.flags = 0;
disconnectRequest.p_dreq_pdata = NULL;
disconnectRequest.dreq_length = 0;
disconnectRequest.qp_type = IB_QPT_RELIABLE_CONN;
disconnectRequest.pfn_cm_drep_cb = IBConnectionDisconnectCallback;
cm_dreq(&disconnectRequest);
----------------------------
During dreq() call, in al_cep_dreq() the p_cep->state is CEP_STATE_TIMEWAIT. It caused the QP is not touched. Maybe that's reason callback never got called.
But why we got a CEP_STATE_TIMEWAIT state? Some previous commands cause the problem? Here is the structure, any data suspicous?
ibbus_fffffadf24334000!_al_kcep
+0x000 cid : 1
+0x008 context : 0xfffffadf`3682aa10
+0x010 p_cid : 0xfffffadf`37d69030 _cep_cid
+0x018 sid : 0x1971302`00000000
+0x020 port_guid : 0
+0x028 p_cmp_buf : (null)
+0x030 cmp_offset : 0 ''
+0x031 cmp_len : 0 ''
+0x034 p2p : 0
+0x038 al_item : _cl_list_item
+0x050 signalled : 1
+0x058 pfn_destroy_cb : (null)
+0x060 p_mad_head : (null)
+0x068 p_mad_tail : 0xfffffadf`364ec558 _ib_mad_element
+0x070 pfn_cb : 0xfffffadf`243ad050 void ibbus_fffffadf24334000!__cm_handler+0
+0x078 p_irp : (null)
+0x080 listen_item : _cl_rbmap_item
+0x0a8 rem_id_item : _cl_rbmap_item
+0x0d0 rem_qp_item : _cl_rbmap_item
+0x0f8 local_comm_id : 0x2000001
+0x0fc remote_comm_id : 0x831d3e8b
+0x100 local_ca_guid : 0xdc8c0200`03c90200
+0x108 remote_ca_guid : 0xe0000001`02971300
+0x110 remote_qpn : 0xa04b700
+0x114 sq_psn : 0xa04b700
+0x118 rq_psn : 0x48000000
+0x11c resp_res : 0x4 ''
+0x11d init_depth : 0x4 ''
+0x11e rnr_nak_timeout : 0x8 ''
+0x120 local_qpn : 0x48000000
+0x124 pkey : 0xffff
+0x126 req_init_depth : 0 ''
+0x128 av : [2] _al_kcep_av
+0x1b8 idx_primary : 0 ''
+0x1c0 alt_av : _al_kcep_av
+0x208 alt_2pkt_life : 0 ''
+0x209 max_2pkt_life : 0x13 ''
+0x20a target_ack_delay : 0x14 ''
+0x20b local_ack_delay : 0xf ''
+0x20c state : 3 ( CEP_STATE_TIMEWAIT )
+0x210 was_active : 1
+0x218 h_mad_svc : 0xfffffadf`36845730 _al_mad_svc
+0x220 p_send_mad : (null)
+0x228 ref_cnt : 1
+0x230 tid : 0x8b3e1d83`06000000
+0x238 max_cm_retries : 0x3 ''
+0x23c retry_timeout : 0x1920
+0x240 timewait_timer : _KTIMER
+0x280 timewait_time : _LARGE_INTEGER 0xffffffff`fc28f600
+0x288 timewait_item : _cl_list_item
+0x2a0 p_mad : (null)
+0x2a8 mads : _mads
+0x3a8 irp_que : _LIST_ENTRY [ 0xfffffadf`3682a928 - 0xfffffadf`3682a928 ]
+0x3b8 psize : 0 ''
+0x3b9 pdata : [196] ""
Thanks,
James
-----Original Message-----
From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
Sent: Wednesday, October 15, 2008 12:41 PM
To: James Yang; ofw at lists.openfabrics.org
Subject: RE: Disconnection problem and AL reference
Hi James,
> After calling cm_dreq(), my callback for it is not called and the
> workitems are also never get called. After cm_dreq() time out, I call
> destroy_qp() with status successful. But the reference count of QP is
> always 1.
When cm_dreq times out, you should get a DREP notification, and the QP should be in the error state.
Check the reference count on the QP before you call cm_dreq. If you can, also check the reference count on the CEP for your QP. When the cm_dreq times out (why is it timing out, did the other side not reply?) again check the CEP reference count. The timeout is processed in the __cep_mad_send_cb function in al_cm_cep.c. Then walk the code to make sure the DREP callback is invoked, or if it isn't, why. The CEP takes a reference on the QP, make sure that gets released when the QP is destroyed (QP destruction should destroy the CEP in the destroying callback for the QP object)
-Fab
More information about the ofw
mailing list