[Openib-windows] Wrong allocation of mads in al_mad_pool.c (user mode) line 800
Tzachi Dar
tzachid at mellanox.co.il
Wed Apr 5 08:28:54 PDT 2006
OK,
So here is another question:
When running SDP code with many connections simultaneously, I some times
get an assert with the following code stack:
ChildEBP RetAddr Args to Child
f78ae758 8086b5d0 80889f00 00000003 f7717000 nt!DbgBreakPoint
f78aea48 8086b6f8 baaaf050 baaaf020 0000021c nt!RtlAssert2+0x104
f78aea64 baaaf234 baaaf050 baaaf020 0000021c nt!RtlAssert+0x18
f78aea88 baaaecb1 8a834970 89c23a30 89ed5e18 ibal!__reject_mad+0x164
[q:\projinf1\trunk\core\al\kernel\al_cm_cep.c @ 540]
f78aead4 baaa93e5 8a834970 89ed5e18 8a6b72a8 ibal!__process_rep+0x531
[q:\projinf1\trunk\core\al\kernel\al_cm_cep.c @ 1348]
f78aeb00 baa78465 8a81a168 ffffffff 8a834970
ibal!__cep_mad_recv_cb+0x1e5
[q:\projinf1\trunk\core\al\kernel\al_cm_cep.c @ 1885]
f78aeb34 baa6dccc 8a81a168 ffffffff 89ed5e18
ibal!__mad_svc_recv_done+0xa55 [q:\projinf1\trunk\core\al\al_mad.c @
2206]
f78aeb94 bab273be 8a81b870 89ed5e18 f77179c0
ibal!mad_disp_recv_done+0x12ac [q:\projinf1\trunk\core\al\al_mad.c @
1004]
f78aebc0 bab26c8d 8a81c1c8 89ed5e18 00000001 ibal!process_mad_recv+0x31e
[q:\projinf1\trunk\core\al\kernel\al_smi.c @ 2284]
f78aec50 bab26632 8a81c1c8 8a8316b0 ffffffff ibal!spl_qp_comp+0x29d
[q:\projinf1\trunk\core\al\kernel\al_smi.c @ 2125]
f78aec78 baaa467b 8a8316b0 ffffffff 8a81c1c8
ibal!spl_qp_recv_comp_cb+0x112
[q:\projinf1\trunk\core\al\kernel\al_smi.c @ 1995]
f78aec94 bad69170 8a8316b0 f78aeca4 00000000 ibal!ci_ca_comp_cb+0x6b
[q:\projinf1\trunk\core\al\kernel\al_ci_ca.c @ 323]
f78aecb8 bad8aff4 8a81f6d8 8a860e38 85000000 mthca!cq_comp_handler+0xc0
[q:\projinf1\trunk\hw\mthca\kernel\hca_data.c @ 326]
f78aecd0 bad8d8a1 8a6b72a8 00000085 8a6b74a8
mthca!mthca_cq_completion+0xa4
[q:\projinf1\trunk\hw\mthca\kernel\mthca_cq.c @ 234]
f78aed04 bad8d5b6 8a6b72a8 8a6b77d8 8a6b72a8 mthca!mthca_eq_int+0x81
[q:\projinf1\trunk\hw\mthca\kernel\mthca_eq.c @ 328]
f78aed28 80805d22 8a6b7840 8a6b77d8 00000000 mthca!mthca_tavor_dpc+0x36
[q:\projinf1\trunk\hw\mthca\kernel\mthca_eq.c @ 455]
f78aed50 80805c07 00000000 0000000e 00000000 nt!KiRetireDpcList+0x61
f78aed54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0x28
In my case the function is called with p_cep->state ==
CEP_STATE_REQ_SENT
and the reason is IB_REJ_STALE_CONN.
Actually, while looking at the two functions __process_rep, __reject_mad
it seems that every time that the insert in __process_rep will fail in
the insert
(that is if( __insert_cep( p_cep ) != p_cep )) we will reach an assert.
Can you tell what the problem here is?
Thanks
Tzachi
Some more information that might help:
1: kd> dt p_cep
Local var @ 0xf78aea94 Type _al_kcep*
0x89c23a30
+0x000 cid : 5
+0x004 context : 0x8a741a18
+0x008 p_cid : 0x8a45f05c _cep_cid
+0x010 sid : 0x22c80100`00000000
+0x018 port_guid : 0
+0x020 p_cmp_buf : (null)
+0x028 cmp_offset : 0 ''
+0x029 cmp_len : 0 ''
+0x02c p2p : 0
+0x030 al_item : _cl_list_item
+0x03c signalled : 0
+0x040 pfn_destroy_cb : (null)
+0x048 p_mad_head : (null)
+0x04c p_mad_tail : (null)
+0x050 pfn_cb : 0xbaa49af0 ibal!__cm_handler+0
+0x054 p_irp : (null)
+0x058 listen_item : _cl_rbmap_item
+0x06c rem_id_item : _cl_rbmap_item
+0x080 rem_qp_item : _cl_rbmap_item
+0x094 local_comm_id : 0x6000005
+0x098 remote_comm_id : 0x2000021
+0x0a0 local_ca_guid : 0xa0392a01`00c90000
+0x0a8 remote_ca_guid : 0x20e92a01`00c90200
+0x0b0 remote_qpn : 0x18044200
+0x0b4 sq_psn : 0x18044200
+0x0b8 rq_psn : 0x19046b00
+0x0bc resp_res : 0x4 ''
+0x0bd init_depth : 0x4 ''
+0x0be rnr_nak_timeout : 0x6 ''
+0x0c0 local_qpn : 0x19046b00
+0x0c4 pkey : 0xffff
+0x0c6 req_init_depth : 0 ''
+0x0c8 av : [2] _al_kcep_av
+0x158 idx_primary : 0 ''
+0x160 alt_av : _al_kcep_av
+0x1a8 alt_2pkt_life : 0 ''
+0x1a9 max_2pkt_life : 0x13 ''
+0x1aa target_ack_delay : 0xf ''
+0x1ab local_ack_delay : 0xf ''
+0x1ac state : 20000001 ( CEP_STATE_REQ_SENT )
+0x1b0 was_active : 1
+0x1b8 h_mad_svc : 0x8a81a168 _al_mad_svc
+0x1c0 p_send_mad : 0x8a6b6030 _ib_mad_element
+0x1c4 ref_cnt : 2
+0x1c8 tid : 0x5000005
+0x1d0 max_cm_retries : 0x4 ''
+0x1d4 retry_timeout : 0x1920
+0x1d8 timewait_timer : _KTIMER
+0x200 timewait_time : _LARGE_INTEGER 0x0
+0x208 timewait_item : _cl_list_item
+0x214 p_mad : (null)
+0x218 mads : _mads
1: kd> dt -r1 ibal!gp_cep_mgr
0x8a6b2398
+0x000 obj : _al_obj
+0x000 pool_item : _cl_pool_item
+0x014 p_parent_obj : 0x8a82a958 _al_obj
+0x018 p_ci_ca : (null)
+0x01c context : (null)
+0x020 async_item : _cl_async_proc_item
+0x038 event : _KEVENT
+0x048 timeout_ms : 0x2710
+0x04c desc_cnt : 0
+0x050 pfn_destroy : 0xbaa5aed0
ibal!sync_destroy_obj+0
+0x054 pfn_destroying : 0xbaab7950
ibal!__destroying_cep_mgr+0
+0x058 pfn_cleanup : (null)
+0x05c pfn_free : 0xbaab7f20 ibal!__free_cep_mgr+0
+0x060 user_destroy_cb : (null)
+0x068 lock : _cl_spinlock
+0x070 obj_list : _cl_qlist
+0x084 ref_cnt : 58
+0x088 list_item : _cl_list_item
+0x094 type : 0x16
+0x098 state : 2 ( CL_INITIALIZED )
+0x0a0 hdl : 0
+0x0a8 h_al : (null)
+0x0b0 hdl_valid : 0
+0x0b8 port_map : _cl_qmap
+0x000 root : _cl_map_item
+0x038 nil : _cl_map_item
+0x070 state : 2 ( CL_INITIALIZED )
+0x074 count : 2
+0x130 lock : 0xf78aeabc
+0x134 cid_vector : _cl_vector
+0x000 size : 0xff
+0x004 grow_size : 0xff
+0x008 capacity : 0xff
+0x00c element_size : 0x10
+0x010 pfn_init : 0xbaab8200 ibal!__cid_init+0
+0x014 pfn_dtor : (null)
+0x018 pfn_copy : 0xbab2b600
ibal!cl_vector_copy_general+0
+0x01c context : (null)
+0x020 alloc_list : _cl_qlist
+0x034 p_ptr_array : 0x8a8193c8 -> 0x8a45f00c
+0x038 state : 2 ( CL_INITIALIZED )
+0x170 free_cid : 0x1a
+0x174 listen_map : _cl_rbmap
+0x000 root : _cl_rbmap_item
+0x014 nil : _cl_rbmap_item
+0x028 state : 2 ( CL_INITIALIZED )
+0x02c count : 0
+0x1a4 conn_id_map : _cl_rbmap
+0x000 root : _cl_rbmap_item
+0x014 nil : _cl_rbmap_item
+0x028 state : 2 ( CL_INITIALIZED )
+0x02c count : 0x18
+0x1d4 conn_qp_map : _cl_rbmap
+0x000 root : _cl_rbmap_item
+0x014 nil : _cl_rbmap_item
+0x028 state : 2 ( CL_INITIALIZED )
+0x02c count : 0x19
+0x208 cep_pool : _NPAGED_LOOKASIDE_LIST
+0x000 L : _GENERAL_LOOKASIDE
+0x048 Lock__ObsoleteButDoNotDelete : 0
+0x258 req_pool : _NPAGED_LOOKASIDE_LIST
+0x000 L : _GENERAL_LOOKASIDE
+0x048 Lock__ObsoleteButDoNotDelete : 0
+0x2a8 timewait_timer : _cl_timer
+0x000 timer : _KTIMER
+0x028 dpc : _KDPC
+0x048 pfn_callback : 0xbaab6f80
ibal!__cep_timewait_cb+0
+0x04c context : (null)
+0x050 timeout_time : 0x5d63a4f8
+0x300 timewait_list : _cl_qlist
+0x000 end : _cl_list_item
+0x00c count : 2
+0x010 state : 2 ( CL_INITIALIZED )
+0x318 h_pnp : 0x8a70f490 _al_pnp
+0x000 obj : _al_obj
+0x0b8 async_item : _cl_async_proc_item
+0x0d0 p_sync_event : 0xf78e2874 _KEVENT
+0x0d4 list_item : _cl_list_item
+0x0e0 dereg_item : _cl_async_proc_item
+0x0f8 pnp_class : 2
+0x100 context_map : _cl_qmap
+0x178 p_rearm_irp : (null)
+0x17c p_dereg_irp : (null)
+0x180 pfn_pnp_cb : 0xbaaa6ad0 ibal!__cep_pnp_cb+0
> -----Original Message-----
> From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com]
> On Behalf Of Fabian Tillier
> Sent: Wednesday, April 05, 2006 12:37 AM
> To: Tzachi Dar
> Cc: openib-windows at openib.org
> Subject: Re: [Openib-windows] Wrong allocation of mads in
> al_mad_pool.c (user mode) line 800
>
> On 4/4/06, Tzachi Dar <tzachid at mellanox.co.il> wrote:
> > OK, I understand my mistake now.
> >
> > There wasn't a particular problem that I was trying to solve. When
> > looking for the error in handling mads that were received
> with GRH, I
> > saw this code, and I thought there was an error here.
> >
> > Thanks again
>
> No worries, Tzachi. Please ask all you want - it will mean
> more of us are intricately familiar with the internals of IBAL.
>
> - Fab
>
>
>
More information about the ofw
mailing list