[openib-general] [Bug 203] Crash on shutdown, timer callback, build 459

bugzilla-daemon at openib.org bugzilla-daemon at openib.org
Wed Aug 23 01:55:28 PDT 2006


http://openib.org/bugzilla/show_bug.cgi?id=203


jbottorff at xsigo.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jbottorff at xsigo.com




------- Comment #1 from jbottorff at xsigo.com  2006-08-23 01:55 -------
I've trapped a write of 0x1 to the dpc context field of a mad data structure.

The stack looks like this just after the write:

f797ab00 ba9de265 ibbus!ib_cancel_mad+0x6c0
[k:\windows-openib\src\winib-459\core\al\al_mad.c @ 1831]
f797ab14 ba984d68 ibbus!al_cancel_sa_req+0x25
[k:\windows-openib\src\winib-459\core\al\al_query.h @ 140]
f797ab28 ba82ec4c ibbus!ib_cancel_query+0x328
[k:\windows-openib\src\winib-459\core\al\al.c @ 429]
f797ac00 ba7fe269 ipoib!ipoib_port_down+0x13c
[k:\windows-openib\src\winib-459\ulp\ipoib\kernel\ipoib_port.c @ 5066]
f797ac74 ba991da1 ipoib!__ipoib_pnp_cb+0xe89
[k:\windows-openib\src\winib-459\ulp\ipoib\kernel\ipoib_adapter.c @ 690]
f797acdc ba994f92 ibbus!__pnp_notify_user+0x561
[k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 523]
f797ad04 ba994cb1 ibbus!__pnp_process_port_forward+0x172
[k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 1230]
f797ad48 ba99479a ibbus!__pnp_check_ports+0x411
[k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 1433]
f797ad70 ba950884 ibbus!__pnp_check_events+0x19a
[k:\windows-openib\src\winib-459\core\al\kernel\al_pnp.c @ 1510]
f797ad8c ba956b54 ibbus!__cl_async_proc_worker+0x94
[k:\windows-openib\src\winib-459\core\complib\cl_async_proc.c @ 153]
f797ada0 ba958c0c ibbus!__cl_thread_pool_routine+0x54
[k:\windows-openib\src\winib-459\core\complib\cl_threadpool.c @ 67]
f797adac 80a07678 ibbus!__thread_callback+0x2c
[k:\windows-openib\src\winib-459\core\complib\kernel\cl_thread.c @ 49]
f797addc 80781346 nt!PspSystemThreadStartup+0x2e
00000000 00000000 nt!KiThreadStartup+0x16

This seems to be canceling an outstanding mad query when the port goes down. An
event that would happen at shutdown, and at irregular other times.

The code that causes the dpc corruption is core\al\al_mad.c about line 1826:

if( !p_list_item )
{
  cl_spinlock_release( &h_mad_svc->obj.lock );
  AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_MAD_SVC, ("mad not found\n") );
                return IB_NOT_FOUND;
}

/* Mark the MAD as having been canceled. */
h_send = PARENT_STRUCT( p_list_item, al_mad_send_t, pool_item );
h_send->canceled = TRUE;

The local pointer h_send seems to not be pointing at the right thing, and the
assignment of TRUE to the cancel field is actually corrupting the dpc context
field.

A structure dump of p_list_item says:

1: kd> dt p_list_item
Local var @ 0xf797aafc Type _cl_list_item*
0x88e76f10 
   +0x000 p_next           : 0x88e76f10 _cl_list_item
   +0x004 p_prev           : 0x88e76f10 _cl_list_item
   +0x008 p_list           : 0x88e76f10 _cl_qlist

The address of this 0x88e76f10 is the same address as the send_list field in
the local h_mad_svc, and believe it represents an empty list header. This
suggests the test for null is an incorrect test for the list being empty.

There is also another case that looks like an incorrect list test in the same
source file.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the general mailing list