[ofw] RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD

Fab Tillier ftillier at windows.microsoft.com
Mon Jun 30 10:13:16 PDT 2008


Yes, the fix is in my list of patches.  I need to see how things are shaping up.  I have significant changes to the NetworkDirect connection support (al_ndi_cm.c) so it's a bit challenging to break things out without duplicating a lot of work.

I'll probably break things out so that they're digestible, I'm hoping to be done sending my changes by mid week.

-Fab

>From: ofw-bounces at lists.openfabrics.org
>[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Leonid Keller
>Sent: Monday, June 30, 2008 9:19 AM
>To: Eleanor Witiak; Fab Tillier
>Cc: ofw at lists.openfabrics.org; AndInc at aol.com
>Subject: [ofw] RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>
>This fix, i believe, is a part of the large patch, which Fab is adding
>now part-by-part.
>Fab, is that right and when to your estimation this part will come come
>to the trunk ?
>
>
>________________________________
>
>       From: Eleanor Witiak [mailto:eleanor.witiak at qlogic.com]
>       Sent: Monday, June 30, 2008 6:07 PM
>       To: Leonid Keller
>       Cc: AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org
>       Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>       PR 1029, which patch 1223 fixed, did get a BSOD of "Bad Pool
>Caller" same as crash below.  Also part of the crash's trace stack below
>is similar to what I got; however, Mike's crash does not have SRP on the
>stack as mine did.  Mike, can you try your test again with my patch?
>
>       Leonid: Also, while working on PR 1029, I ran into an IBAL
>problem that I sent to you.  I have attached our mail correspondence.  I
>have created a temp patch in IBAL (without my patch 1223) just to see if
>it also fixed my "Bad Pool Caller" BSOD and it did.  In addition, I have
>also run with the same temp IBAL patch and it also got rid of the BSOD
>while trying to reproduce PR 1037.  I think that Mike's crash might be
>running into this problem.  Is your patch ready?  If so, I would love to
>test with it.
>
>       Thanks,
>       Eleanor
>
>
>________________________________
>
>       From: Leonid Keller [mailto:leonid at mellanox.co.il]
>       Sent: Monday, June 30, 2008 10:18 AM
>       To: Eleanor Witiak
>       Cc: AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org
>       Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>
>       Thanks, but i meant to ask, whether this crash looks like the
>one, you've solved in 1223 ?
>
>
>________________________________
>
>               From: Eleanor Witiak [mailto:eleanor.witiak at qlogic.com]
>               Sent: Monday, June 30, 2008 4:40 PM
>               To: Leonid Keller; AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org
>               Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>               Yes, the patch did come after the 1.1 release.  The
>patch revision # is 1223; the affected files are srp_connection.c and
>srp_session.c.
>
>               Eleanor
>
>
>________________________________
>
>               From: Leonid Keller [mailto:leonid at mellanox.co.il]
>               Sent: Monday, June 30, 2008 4:34 AM
>               To: AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org; Eleanor Witiak
>               Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>
>               a) don't know;
>               b) may be caused by a);
>               c) may be caused by b).
>
>               A very important patch of Eleanor (WinOF 1223),
>preventing BSOD upon sudden srpt disconnection, has come after closing
>the release.
>               Eleanor, could you check whether it's the case.
>
>               Here is some more information, based on the sent
>minidumps:
>
>               1: kd> !analyze -v
>               BAD_POOL_CALLER (c2)
>               The current thread is making a bad pool request.
>Typically this is at a bad IRQL level or double freeing the same
>allocation, etc.
>               Arguments:
>               Arg1: 0000000000000007, Attempt to free pool which was
>already freed
>               Arg2: 000000000000121a, (reserved)
>               Arg3: 00000000012b0011, Memory contents of the pool
>block
>               Arg4: fffffadf99483c50, Address of the block of pool
>being deallocated
>
>               Debugging Details:
>               ------------------
>
>
>               POOL_ADDRESS:  fffffadf99483c50
>
>               FREED_POOL_TAG:  priv
>
>               BUGCHECK_STR:  0xc2_7_priv
>
>               CUSTOMER_CRASH_COUNT:  1
>
>               DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP
>
>               PROCESS_NAME:  System
>
>               CURRENT_IRQL:  0
>
>               LAST_CONTROL_TRANSFER:  from fffff800011aa769 to
>fffff8000102e950
>
>               STACK_TEXT:
>               fffffadf`90d7bbc8 fffff800`011aa769 : 00000000`000000c2
>00000000`00000007 00000000`0000121a 00000000`012b0011 : nt!KeBugCheckEx
>               fffffadf`90d7bbd0 fffffadf`8f554621 : fffffadf`99483c50
>00000000`00000080 fffffadf`99483c50 00000000`00000080 :
>nt!ExFreePoolWithTag+0x401
>               fffffadf`90d7bc90 fffffadf`8f51f568 : fffffadf`9c813c00
>fffffadf`9bddd3e8 fffffadf`99483c78 fffffadf`9bddd3c8 :
>ibbus!async_destroy_cb+0x171
>[d:\openib-windows-svn\1177\gen1\trunk\core\al\al_common.c @ 686]
>               fffffadf`90d7bce0 fffffadf`8f521a1d : fffffadf`9c8764e0
>fffffadf`9bddd2b0 fffffadf`9bed0040 fffff800`011b5500 :
>ibbus!__cl_async_proc_worker+0x98
>[d:\openib-windows-svn\1177\gen1\trunk\core\complib\cl_async_proc.c @
>153]
>               fffffadf`90d7bd10 fffffadf`8f522108 : 00000000`00000000
>fffffadf`9c8764e0 fffffadf`9c8764e0 fffff800`011b5500 :
>ibbus!__cl_thread_pool_routine+0x4d
>[d:\openib-windows-svn\1177\gen1\trunk\core\complib\cl_threadpool.c @
>66]
>               fffffadf`90d7bd40 fffff800`0124b972 : 00000000`00000000
>fffffadf`9beaf040 fffffadf`9beaf040 fffffadf`9c168bf0 :
>ibbus!__thread_callback+0x28
>[d:\openib-windows-svn\1177\gen1\trunk\core\complib\kernel\cl_thread.c @
>49]
>               fffffadf`90d7bd70 fffff800`010202d6 : fffff800`011b1180
>fffffadf`9bed0040 fffff800`011b5500 fffffadf`9c8b81c0 :
>nt!PspSystemThreadStartup+0x3e
>               fffffadf`90d7bdd0 00000000`00000000 : 00000000`00000000
>00000000`00000000 00000000`00000000 00000000`00000000 :
>nt!KxStartSystemThread+0x16
>
>               FOLLOWUP_IP:
>               ibbus!async_destroy_cb+171
>[d:\openib-windows-svn\1177\gen1\trunk\core\al\al_common.c @ 686]
>               SYMBOL_STACK_INDEX:  2
>
>               SYMBOL_NAME:  ibbus!async_destroy_cb+171
>
>
>________________________________
>
>                       From: AndInc at aol.com [mailto:AndInc at aol.com]
>                       Sent: Friday, June 27, 2008 2:14 AM
>                       To: sean.hefty at intel.com; Leonid Keller;
>ofw at lists.openfabrics.org
>                       Subject: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>                       A simple sequential/random IOMeter script of
>small block writes produces a BSOD in this environment. Trace is below,
>very repeatable, two similar failures in the trace. Any clues about
>what's causing the (a) error (b) disconnect and (c) BSOD?
>
>                       Thanks,
>
>                       Mike Anderson
>
>                       [15513.043769] local QP operation err (QPN
>0c004a, WQE index 39b8, vendor syndrome 6f, opcode = 5e)
>                       [15513.043777] CQE contents 000c004a 00000000
>00000000 00000000 00000000 00000000 39b86f02 0000005e
>                       [15513.043779] ib_srpt: failed send status= 2
>                       [15513.043783] ib_srpt: failed send status= 5
>                       [15513.043786] ib_srpt: failed send status= 5
>                       [15513.043801] ib_srpt: failed send status= 5
>                       [15513.043851] ib_srpt: failed send status= 5
>                       [15513.043855] ib_srpt: failed send status= 5
>                       [15513.043857] ib_srpt: failed send status= 5
>                       [15513.043860] ib_srpt: failed send status= 5
>                       [15513.043873] ib_srpt: QP event 16 on cm_id=
>ffff8100ba389800 sess_name= 0x0002c9030000a50c0002c9030000a3ec state= 1
>                       [15513.043877] ib_srpt: Schedule
>CM_DISCONNECT_WORK
>                       [15513.043967] ib_srpt: srpt_cm_drep_recv[1636]
>cm_id= ffff8100ba389800
>                       [15513.044220] ib_srpt: srpt_release_channel:
>Release sess= ffff8101c27d3cf0 sess_name=
>0x0002c9030000a50c0002c9030000a3ec active_cmd= 7
>                       [15513.044223] [6160]:
>scst_unregister_session:4639:Unregistering session ffff8101c27d3cf0
>(wait 0)
>                       [15739.551108] ib_srpt: ASYNC event= 10 on
>device= mlx4_0
>                       [15831.623484] ib_srpt: ASYNC event= 17 on
>device= mlx4_0
>                       [15831.624195] ib_srpt: ASYNC event= 11 on
>device= mlx4_0
>                       [15831.624400] ib_srpt: ASYNC event= 11 on
>device= mlx4_0
>                       [15831.636997] ib_srpt: ASYNC event= 9 on
>device= mlx4_0
>                       [15833.127349] ib_srpt: Host login
>i_port_id=0x2c9030000a50c:0x2c9030000a3ec
>t_port_id=0x2c9030000a50c:0x2c9030000a50c it_iu_len=996
>                       [15833.128607] ib_srpt: srpt_create_ch_ib[1228]
>max_cqe= 4095 max_sge= 29 cm_id= ffff8101b38b0a00
>                       [15833.128927] [6823]: scst:
>scst_init_session:4509:Using security group "Default" for initiator
>"0x0002c9030000a50c0002c9030000a3ec"
>                       [15833.128938] [6823]:
>scst_init_session:4512:Assigning session ffff810100467c30 to acg Default
>                       [15833.128951] [6823]:
>scst_alloc_add_tgt_dev:405:host=9, channel=0, id=0, lun=0, SCST lun=0
>                       [15833.128958] [6823]:
>scst_alloc_set_UA:2486:Adding new UA to tgt_dev ffff8101c953de60
>                       [15833.128980] ib_srpt: Establish connection
>sess= ffff810100467c30 name= 0x0002c9030000a50c0002c9030000a3ec cm_id=
>ffff8101b38b0a00
>                       [15833.132787] [6818]: scst:
>scst_set_pending_UA:2420:Setting pending UA cmd ffff810100ba66d0
>                       [15841.612022] ib_srpt: ASYNC event= 11 on
>device= mlx4_0
>                       [16046.074918] igb: eth1: igb_watchdog_task: NIC
>Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
>                       [16056.648672] eth1: no IPv6 routers present
>                       [17209.196025] local QP operation err (QPN
>0e004a, WQE index 3d40, vendor syndrome 6f, opcode = 5e)
>                       [17209.196032] CQE contents 000e004a 00000000
>00000000 00000000 00000000 00000000 3d406f02 000000de
>                       [17209.196033] ib_srpt: failed send status= 2
>                       [17209.196037] ib_srpt: failed send status= 5
>                       [17209.196040] ib_srpt: failed send status= 5
>                       [17209.196044] ib_srpt: failed send status= 5
>                       [17209.196069] ib_srpt: QP event 16 on cm_id=
>ffff8101b38b0a00 sess_name= 0x0002c9030000a50c0002c9030000a3ec state= 1
>                       [17209.196074] ib_srpt: Schedule
>CM_DISCONNECT_WORK
>                       [17209.196078] ib_srpt: srpt_xmit_response[1960]
>tag= 10296991 channel in bad state 2
>                       [17209.196083] ib_srpt: failed send status= 5
>                       [17209.196089] [6820]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.196099] ib_srpt: srpt_xmit_response[1960]
>tag= 10296992 channel in bad state 2
>                       [17209.196104] [6819]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.196157] ib_srpt: srpt_xmit_response[1960]
>tag= 10296993 channel in bad state 2
>                       [17209.196160] [6817]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.196173] ib_srpt: srpt_cm_drep_recv[1636]
>cm_id= ffff8101b38b0a00
>                       [17209.196179] ib_srpt: srpt_xmit_response[1960]
>tag= 10296994 channel in bad state 2
>                       [17209.196182] [6814]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.196265] ib_srpt: srpt_xmit_response[1960]
>tag= 10296995 channel in bad state 2
>                       [17209.196269] [6818]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.196277] ib_srpt: srpt_xmit_response[1960]
>tag= 10296996 channel in bad state 2
>                       [17209.196278] [6818]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.196308] ib_srpt: srpt_xmit_response[1960]
>tag= 10296997 channel in bad state 2
>                       [17209.196309] [6815]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
>                       [17209.197269] ib_srpt: srpt_release_channel:
>Release sess= ffff810100467c30 sess_name=
>0x0002c9030000a50c0002c9030000a3ec active_cmd= 3
>                       [17209.197272] [6158]:
>scst_unregister_session:4639:Unregistering session ffff810100467c30
>(wait 0)
>                       linux-gen24:~ #
>
>
>________________________________
>
>                       Gas prices getting you down? Search AOL Autos
>for fuel-efficient used cars
><http://autos.aol.com/used?ncid=aolaut00050000000007> .
>



More information about the ofw mailing list