[Openib-windows] IPoIB crash

Fabian Tillier ftillier at silverstorm.com
Tue Sep 5 13:22:16 PDT 2006


Hi Anatoly,

On 9/4/06, Anatoly Lisenko <anatoly4work at gmail.com> wrote:
>
> Hi Fabian,
>
> From time to time, we see blue screens in IPoIB (Bug Check 0xD1:
> DRIVER_IRQL_NOT_LESS_OR_EQUAL).
>
> OpenIB head revision 460.
>
>
> The call stack is:
>
> nt!KeBugCheckEx
> nt!KiBugCheckDispatch+0x74
> nt!KiPageFault+0x207
> ipoib!cl_fmap_remove_item+0x3c
> [d:\projects\win-ibhost\trunk\core\complib\cl_map.c @ 1005]
> ipoib!__endpt_mgr_reset_all+0x1e2
> [d:\projects\win-ibhost\trunk\ulp\ipoib\kernel\ipoib_port.c
> @ 4094]
> ipoib!ipoib_port_down+0x1fe
> [d:\projects\win-ibhost\trunk\ulp\ipoib\kernel\ipoib_port.c
> @ 5098]
> ipoib!__ipoib_pnp_cb+0x3f2
> [d:\projects\win-ibhost\trunk\ulp\ipoib\kernel\ipoib_adapter.c
> @ 615]
> ibbus!__pnp_notify_user+0x1a7
> [d:\projects\win-ibhost\trunk\core\al\kernel\al_pnp.c @
> 525]
> ibbus!__pnp_process_remove_port+0x1c2
> [d:\projects\win-ibhost\trunk\core\al\kernel\al_pnp.c @
> 1004]
> ibbus!__pnp_process_remove_ca+0x54
> [d:\projects\win-ibhost\trunk\core\al\kernel\al_pnp.c @
> 1049]
> ibbus!__cl_async_proc_worker+0x73
> [d:\projects\win-ibhost\trunk\core\complib\cl_async_proc.c
> @ 153]
> ibbus!__cl_thread_pool_routine+0x4b
> [d:\projects\win-ibhost\trunk\core\complib\cl_threadpool.c
> @ 66]
> ibbus!__thread_callback+0x28
> [d:\projects\win-ibhost\trunk\core\complib\kernel\cl_thread.c
> @ 49]
> nt!ObAssignSecurity+0x43e
> nt!KeInsertQueue+0x2e6
>
>
> From looking at code (ipoib_port.c @ 4094)
> if( p_endpt->dlid )
> {
>        cl_qmap_remove_item( &p_port->endpt_mgr.lid_endpts,
> &p_endpt->lid_item );
>        p_endpt->dlid = 0;
> }
>
> From analyzing the crash dump, I've found that
> p_endpt->lid_item->pool_item.list_item->p_next is NULL.
>
> The crash itself happens in the line "p_list_item->p_next->p_prev=
> p_list_item->p_prev" in the inline function __cl_primitive_remove() called
> from cl_fmap_remove_item()
>
> I've search for unprotected changes of lid_item, and found the following (at
> __path_query_cb):
>
>  if( !p_endpt->dlid )
> {
>             cl_map_item_t   *p_qitem;
>
>             /* This is a subnet local endpoint that does not have its LID
> set. */
>             p_endpt->dlid = p_path->dlid;
>
>             /*
>              * Insert the item in the LID map so that locally routed unicast
>              * traffic will resolve it properly.
>              */
>             cl_obj_lock( &p_port->obj );
>
>             p_qitem = cl_qmap_insert( &p_port->endpt_mgr.lid_endpts,
>
> p_endpt->dlid, &p_endpt->lid_item );
>             CL_ASSERT( p_qitem == &p_endpt->lid_item );
>             cl_obj_unlock( &p_port->obj );
> }
>
> What do you say ?

That's definitely a bug.

> Do we need to lock the reference to p_endpt->dlid with cl_obj_lock/unlock(
> &p_endpt->obj ) ?

You need a lock, but it needs to beh the port object's lock since that
is what is held when the LID is checked in __endpt_mgr_reset_all.

We need to take the port lock before the if( !p_endpt->dlid ).

Can you try the attached patch and see if it resovles the issue?  If
it does, let me know and I will check it in.

Thanks,

- Fab

Index: base/ulp/ipoib/kernel/ipoib_endpoint.c
===================================================================
--- base/ulp/ipoib/kernel/ipoib_endpoint.c	(revision 469)
+++ base/ulp/ipoib/kernel/ipoib_endpoint.c	(working copy)
@@ -408,6 +408,7 @@
 	av_attr.grh.src_gid = p_path->sgid;
 	av_attr.grh.dest_gid = p_path->dgid;
 	
+	cl_obj_lock( &p_port->obj );
 	if( !p_endpt->dlid )
 	{
 		cl_map_item_t	*p_qitem;
@@ -418,12 +419,11 @@
 		 * Insert the item in the LID map so that locally routed unicast
 		 * traffic will resolve it properly.
 		 */
-		cl_obj_lock( &p_port->obj );
 		p_qitem = cl_qmap_insert( &p_port->endpt_mgr.lid_endpts,
 			p_endpt->dlid, &p_endpt->lid_item );
 		CL_ASSERT( p_qitem == &p_endpt->lid_item );
-		cl_obj_unlock( &p_port->obj );
 	}
+	cl_obj_unlock( &p_port->obj );
 	av_attr.static_rate = ib_path_rec_rate( p_path );
 	av_attr.path_bits = 0;
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ipoib_dlid_lock.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060905/abfbb129/attachment.ksh>


More information about the ofw mailing list