[openib-general] [Bug 214] New: IB Stack ASSERTS while handling stale connections.
bugzilla-daemon at openib.org
bugzilla-daemon at openib.org
Mon Aug 28 16:39:30 PDT 2006
http://openib.org/bugzilla/show_bug.cgi?id=214
Summary: IB Stack ASSERTS while handling stale connections.
Product: OpenFabrics Windows
Version: unspecified
Platform: X86
OS/Version: Other
Status: NEW
Severity: critical
Priority: P1
Component: Core
AssignedTo: bugzilla at openib.org
ReportedBy: pgarg at xsigo.com
We are encountering a serious bug in the stack which happens while there is a
stale connection in the list. Here is the call stack:
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has
been reached.
DEFAULT_BUCKET_ID: STATUS_BREAKPOINT
BUGCHECK_STR: 0x0
CURRENT_IRQL: 2
ASSERT_DATA: p_item->p_map == p_map
ASSERT_FILE_LOCATION: k:\windows-openib\src\winib-461\core\complib\cl_map.c at
Line 422
LAST_CONTROL_TRANSFER: from 80873046 to 8087163c
STACK_TEXT:
f78a2978 80873046 ffdffa40 00000003 bac99480 nt!DbgBreakPoint
f78a2c60 ba82e054 ba82ded0 ba82de98 000001a6 nt!RtlAssert+0xba
f78a2c98 ba90bb85 8a1741ac 8928ad54 f78a2ce4 ibbus!cl_rbmap_remove_item+0xb4
[k:\windows-openib\src\winib-461\core\complib\cl_map.c @ 422]
f78a2ca8 ba9019af 8928ace8 ffdffa40 bac99480 ibbus!__remove_cep+0xb5
[k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 2825]
f78a2ce4 ba900e0f 8928ace8 89b0dc00 20000001 ibbus!__process_rej+0x5ef
[k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 939]
f78a2d08 ba903aae 8928ace8 bac8ba5b 20000001 ibbus!__process_stale+0x10f
[k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 1019]
f78a2d44 ba8fd748 89b3c248 89adc5b8 f78a2d6c ibbus!__rep_handler+0x54e
[k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 1436]
f78a2d70 ba8b35fe 8a116008 ffffffff 89b3c248 ibbus!__cep_mad_recv_cb+0x1e8
[k:\windows-openib\src\winib-461\core\al\kernel\al_cm_cep.c @ 1969]
f78a2da4 ba8a8caf 8a116008 ffffffff 89adc5b8 ibbus!__mad_svc_recv_done+0xa8e
[k:\windows-openib\src\winib-461\core\al\al_mad.c @ 2215]
f78a2e04 ba85356b 89ba6228 89adc5b8 8a1597e0 ibbus!mad_disp_recv_done+0x130f
[k:\windows-openib\src\winib-461\core\al\al_mad.c @ 1013]
f78a2e34 ba852dc6 8a0c7720 89adc5b8 88deb8c8 ibbus!process_mad_recv+0x34b
[k:\windows-openib\src\winib-461\core\al\kernel\al_smi.c @ 2309]
f78a2ec4 ba8526eb 8a0c7720 8a1578c8 ffffffff ibbus!spl_qp_comp+0x2a6
[k:\windows-openib\src\winib-461\core\al\kernel\al_smi.c @ 2135]
f78a2eec ba8683ab 8a1578c8 ffffffff 8a0c7720 ibbus!spl_qp_recv_comp_cb+0x11b
[k:\windows-openib\src\winib-461\core\al\kernel\al_smi.c @ 2005]
f78a2f08 bac723ca 8a1578c8 f78a2f18 00000000 ibbus!ci_ca_comp_cb+0x6b
[k:\windows-openib\src\winib-461\core\al\kernel\al_ci_ca.c @ 329]
f78a2f2c bac96e5f 8a1341a8 8a20f250 85000000 mthca!cq_comp_handler+0xca
[c:\winib-461\hw\mthca\kernel\hca_data.c @ 329]
f78a2f44 bac99701 8a159210 00000085 8a17a008 mthca!mthca_cq_completion+0xcf
[c:\winib-461\hw\mthca\kernel\mthca_cq.c @ 239]
f78a2f78 bac994b6 8a159210 8a159768 8a159210 mthca!mthca_eq_int+0x81
[c:\winib-461\hw\mthca\kernel\mthca_eq.c @ 328]
f78a2f9c 80831cb2 8a1597e0 8a159768 00000000 mthca!mthca_tavor_dpc+0x36
[c:\winib-461\hw\mthca\kernel\mthca_eq.c @ 455]
f78a2ff4 8088cf9f b94d7b1c 00000000 00000000 nt!KiRetireDpcList+0xca
f78a2ff8 b94d7b1c 00000000 00000000 00000000 nt!KiDispatchInterrupt+0x3f
WARNING: Frame IP not in any known module. Following frames may be wrong.
8088cf9f 00000000 0000000a bb837775 00000128 0xb94d7b1c
STACK_COMMAND: kb
FOLLOWUP_IP:
ibbus!cl_rbmap_remove_item+b4
[k:\windows-openib\src\winib-461\core\complib\cl_map.c @ 422]
ba82e054 c745e800000000 mov dword ptr [ebp-18h],0
FAULTING_SOURCE_CODE:
418:
419: CL_ASSERT( p_map );
420: CL_ASSERT( p_map->state == CL_INITIALIZED );
421: CL_ASSERT( p_item );
> 422: CL_ASSERT( p_item->p_map == p_map );
423:
424: if( p_item == cl_rbmap_end( p_map ) )
425: return;
426:
427: if( p_item->p_right == &p_map->nil )
The problem seems to be that when in function __rep_handler the following line
of code fails the check
if( __insert_cep( p_cep ) != p_cep )
This seems to mean we have something stale in the list. We call the function
status = __process_stale( p_cep );
which calls the function __process_rej.
__process_rej then calls __remove_cep which tries to remove the p_cep from
list.
We think the problem is right here. This is the pointer to the new p_cep which
was never inserted in the list because the check in _insert_cep function
failed.
Now instead of removing the old p_cep from the list, we are removing the new
one. The cl_rbmap_remove_item function doest really validate the pointer given
to it and always assumes the item was in the list.
This also begs the question that why was an item present in the list already.
We are seeing this behavior when we try to make q-pairs to a target repeatedly
i.e create a q-pair and then destroy it and then re-create it. It seems like if
we recreate the q-pair within a few seconds (3) then the probelem happens and
if we wait for 5-10 seconds the problem seems to go away.
Is there a design limitation with the stack that a q-pair connection to the
same target can not be made again with a certain time period? If yes what is
the time perio.
If not, what should we be doing to ensure proper cleanup?
I guess even if there was a limitation there is still a bug here that the stack
should be able to handle.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the general
mailing list