[ofw] Patch: [ipoib] Make sure that the dlid is zero if it is not in the list.

Tzachi Dar tzachid at mellanox.co.il
Thu Nov 6 12:47:00 PST 2008


First, I'm happy to say that I have found the source of the blue screens
that we had in the lists.
 
The problem happens when the function __mcast_cb and tries to enter an
end_point to the dlid list and fails. (see call stack below)
 
As a result we have an end_point that is not in the dlid list but has a
dlid that is not zero. When we take the endpoint from the list, 
we try to remove it from the dlid lists and crash.
 
This checkin makes sure that once we fail to enter the list dlid will be
0, we will not try to remove it from the list and no blue screen.
 
The real issue is what else should we done. I'm afraid that things will
not work as this endpoint has no dlid.
My ideas are:
 
1) Remove this endpoint from the list.
2) Remove the other endpoint from the list (the one that has the same
dlid)
3) Force a reset by NDIS, to start things all over again.
 
What are the community thoughts.
 
 
 
call stack of the program:
Child-SP          RetAddr           Call Site
fffffa60`051fa648 fffff800`017374a8 nt!DbgBreakPoint
fffffa60`051fa650 fffffa60`053bfdd5 nt!RtlAssert+0x108
fffffa60`051fab70 fffffa60`052e8f62 ipoib!__mcast_cb+0xc45
[s:\builds\3433\branches\mlnx_winof_2-0\ulp\ipoib\kernel\ipoib_port.c @
6096]
fffffa60`051fabf0 fffffa60`05264e0f ibbus!join_async_cb+0x4b2
[s:\builds\3433\branches\mlnx_winof_2-0\core\al\al_mcast.c @ 535]
fffffa60`051fac90 fffffa60`0526ade5 ibbus!__cl_async_proc_worker+0xbf
[s:\builds\3433\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]
fffffa60`051face0 fffffa60`0526c0cc ibbus!__cl_thread_pool_routine+0x75
[s:\builds\3433\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
67]
fffffa60`051fad20 fffff800`018c1de3 ibbus!__thread_callback+0x3c
[s:\builds\3433\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c
@ 49]
fffffa60`051fad50 fffff800`016d8536 nt!PspSystemThreadStartup+0x57
fffffa60`051fad80 00000000`00000000 nt!KiStartSystemThread+0x16

 
 
Index: Q:/projinf4/trunk/ulp/ipoib/kernel/ipoib_port.c
===================================================================
--- Q:/projinf4/trunk/ulp/ipoib/kernel/ipoib_port.c (revision 3441)
+++ Q:/projinf4/trunk/ulp/ipoib/kernel/ipoib_port.c (revision 3442)
@@ -5007,6 +5007,10 @@
   p_qitem = cl_qmap_insert(
    &p_port->endpt_mgr.lid_endpts, p_endpt->dlid, &p_endpt->lid_item );
   CL_ASSERT( p_qitem == &p_endpt->lid_item );
+  if (p_qitem != &p_endpt->lid_item) {
+   // Since we failed to insert into the list, make sure it is not
removed
+   p_endpt->dlid =0;
+  }
  }
 
  IPOIB_EXIT( IPOIB_DBG_ENDPT );
@@ -6094,6 +6098,11 @@
   p_qitem = cl_qmap_insert(
    &p_port->endpt_mgr.lid_endpts, p_endpt->dlid, &p_endpt->lid_item );
   CL_ASSERT( p_qitem == &p_endpt->lid_item );
+  if (p_qitem != &p_endpt->lid_item) {
+   // Since we failed to insert into the list, make sure it is not
removed
+   p_endpt->dlid =0;
+  }
+  
  }
  /* set flag that endpoint is use */
  p_endpt->is_in_use = TRUE;

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20081106/b50939f4/attachment.html>


More information about the ofw mailing list