[ofw] Patch: [ipoib] Make sure that the dlid is zero if it is not inthe list.

Tzachi Dar tzachid at mellanox.co.il
Mon Nov 10 09:16:25 PST 2008


I have applied the minimum change (set the dlid to 0) on 1745, 1746.
 
This should stop the blue screen.
 
Thanks
Tzachi
 


________________________________

	From: ofw-bounces at lists.openfabrics.org
[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Tzachi Dar
	Sent: Thursday, November 06, 2008 10:47 PM
	To: ofw at lists.openfabrics.org
	Subject: [ofw] Patch: [ipoib] Make sure that the dlid is zero if
it is not inthe list.
	
	
	First, I'm happy to say that I have found the source of the blue
screens that we had in the lists.
	 
	The problem happens when the function __mcast_cb and tries to
enter an end_point to the dlid list and fails. (see call stack below)
	 
	As a result we have an end_point that is not in the dlid list
but has a dlid that is not zero. When we take the endpoint from the
list, 
	we try to remove it from the dlid lists and crash.
	 
	This checkin makes sure that once we fail to enter the list dlid
will be 0, we will not try to remove it from the list and no blue
screen.
	 
	The real issue is what else should we done. I'm afraid that
things will not work as this endpoint has no dlid.
	My ideas are:
	 
	1) Remove this endpoint from the list.
	2) Remove the other endpoint from the list (the one that has the
same dlid)
	3) Force a reset by NDIS, to start things all over again.
	 
	What are the community thoughts.
	 
	 
	 
	call stack of the program:
	Child-SP          RetAddr           Call Site
	fffffa60`051fa648 fffff800`017374a8 nt!DbgBreakPoint
	fffffa60`051fa650 fffffa60`053bfdd5 nt!RtlAssert+0x108
	fffffa60`051fab70 fffffa60`052e8f62 ipoib!__mcast_cb+0xc45
[s:\builds\3433\branches\mlnx_winof_2-0\ulp\ipoib\kernel\ipoib_port.c @
6096]
	fffffa60`051fabf0 fffffa60`05264e0f ibbus!join_async_cb+0x4b2
[s:\builds\3433\branches\mlnx_winof_2-0\core\al\al_mcast.c @ 535]
	fffffa60`051fac90 fffffa60`0526ade5
ibbus!__cl_async_proc_worker+0xbf
[s:\builds\3433\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]
	fffffa60`051face0 fffffa60`0526c0cc
ibbus!__cl_thread_pool_routine+0x75
[s:\builds\3433\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
67]
	fffffa60`051fad20 fffff800`018c1de3 ibbus!__thread_callback+0x3c
[s:\builds\3433\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c
@ 49]
	fffffa60`051fad50 fffff800`016d8536
nt!PspSystemThreadStartup+0x57
	fffffa60`051fad80 00000000`00000000 nt!KiStartSystemThread+0x16
	
	 
	 
	Index: Q:/projinf4/trunk/ulp/ipoib/kernel/ipoib_port.c
	
===================================================================
	--- Q:/projinf4/trunk/ulp/ipoib/kernel/ipoib_port.c (revision
3441)
	+++ Q:/projinf4/trunk/ulp/ipoib/kernel/ipoib_port.c (revision
3442)
	@@ -5007,6 +5007,10 @@
	   p_qitem = cl_qmap_insert(
	    &p_port->endpt_mgr.lid_endpts, p_endpt->dlid,
&p_endpt->lid_item );
	   CL_ASSERT( p_qitem == &p_endpt->lid_item );
	+  if (p_qitem != &p_endpt->lid_item) {
	+   // Since we failed to insert into the list, make sure it is
not removed
	+   p_endpt->dlid =0;
	+  }
	  }
	 
	  IPOIB_EXIT( IPOIB_DBG_ENDPT );
	@@ -6094,6 +6098,11 @@
	   p_qitem = cl_qmap_insert(
	    &p_port->endpt_mgr.lid_endpts, p_endpt->dlid,
&p_endpt->lid_item );
	   CL_ASSERT( p_qitem == &p_endpt->lid_item );
	+  if (p_qitem != &p_endpt->lid_item) {
	+   // Since we failed to insert into the list, make sure it is
not removed
	+   p_endpt->dlid =0;
	+  }
	+  
	  }
	  /* set flag that endpoint is use */
	  p_endpt->is_in_use = TRUE;
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20081110/0ee43c56/attachment.html>


More information about the ofw mailing list