[ofw] RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
Fab Tillier
ftillier at windows.microsoft.com
Mon Jun 30 10:13:16 PDT 2008
Yes, the fix is in my list of patches. I need to see how things are shaping up. I have significant changes to the NetworkDirect connection support (al_ndi_cm.c) so it's a bit challenging to break things out without duplicating a lot of work.
I'll probably break things out so that they're digestible, I'm hoping to be done sending my changes by mid week.
-Fab
>From: ofw-bounces at lists.openfabrics.org
>[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Leonid Keller
>Sent: Monday, June 30, 2008 9:19 AM
>To: Eleanor Witiak; Fab Tillier
>Cc: ofw at lists.openfabrics.org; AndInc at aol.com
>Subject: [ofw] RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>
>This fix, i believe, is a part of the large patch, which Fab is adding
>now part-by-part.
>Fab, is that right and when to your estimation this part will come come
>to the trunk ?
>
>
>________________________________
>
> From: Eleanor Witiak [mailto:eleanor.witiak at qlogic.com]
> Sent: Monday, June 30, 2008 6:07 PM
> To: Leonid Keller
> Cc: AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org
> Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
> PR 1029, which patch 1223 fixed, did get a BSOD of "Bad Pool
>Caller" same as crash below. Also part of the crash's trace stack below
>is similar to what I got; however, Mike's crash does not have SRP on the
>stack as mine did. Mike, can you try your test again with my patch?
>
> Leonid: Also, while working on PR 1029, I ran into an IBAL
>problem that I sent to you. I have attached our mail correspondence. I
>have created a temp patch in IBAL (without my patch 1223) just to see if
>it also fixed my "Bad Pool Caller" BSOD and it did. In addition, I have
>also run with the same temp IBAL patch and it also got rid of the BSOD
>while trying to reproduce PR 1037. I think that Mike's crash might be
>running into this problem. Is your patch ready? If so, I would love to
>test with it.
>
> Thanks,
> Eleanor
>
>
>________________________________
>
> From: Leonid Keller [mailto:leonid at mellanox.co.il]
> Sent: Monday, June 30, 2008 10:18 AM
> To: Eleanor Witiak
> Cc: AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org
> Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>
> Thanks, but i meant to ask, whether this crash looks like the
>one, you've solved in 1223 ?
>
>
>________________________________
>
> From: Eleanor Witiak [mailto:eleanor.witiak at qlogic.com]
> Sent: Monday, June 30, 2008 4:40 PM
> To: Leonid Keller; AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org
> Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
> Yes, the patch did come after the 1.1 release. The
>patch revision # is 1223; the affected files are srp_connection.c and
>srp_session.c.
>
> Eleanor
>
>
>________________________________
>
> From: Leonid Keller [mailto:leonid at mellanox.co.il]
> Sent: Monday, June 30, 2008 4:34 AM
> To: AndInc at aol.com; sean.hefty at intel.com;
>ofw at lists.openfabrics.org; Eleanor Witiak
> Subject: RE: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
>
> a) don't know;
> b) may be caused by a);
> c) may be caused by b).
>
> A very important patch of Eleanor (WinOF 1223),
>preventing BSOD upon sudden srpt disconnection, has come after closing
>the release.
> Eleanor, could you check whether it's the case.
>
> Here is some more information, based on the sent
>minidumps:
>
> 1: kd> !analyze -v
> BAD_POOL_CALLER (c2)
> The current thread is making a bad pool request.
>Typically this is at a bad IRQL level or double freeing the same
>allocation, etc.
> Arguments:
> Arg1: 0000000000000007, Attempt to free pool which was
>already freed
> Arg2: 000000000000121a, (reserved)
> Arg3: 00000000012b0011, Memory contents of the pool
>block
> Arg4: fffffadf99483c50, Address of the block of pool
>being deallocated
>
> Debugging Details:
> ------------------
>
>
> POOL_ADDRESS: fffffadf99483c50
>
> FREED_POOL_TAG: priv
>
> BUGCHECK_STR: 0xc2_7_priv
>
> CUSTOMER_CRASH_COUNT: 1
>
> DEFAULT_BUCKET_ID: DRIVER_FAULT_SERVER_MINIDUMP
>
> PROCESS_NAME: System
>
> CURRENT_IRQL: 0
>
> LAST_CONTROL_TRANSFER: from fffff800011aa769 to
>fffff8000102e950
>
> STACK_TEXT:
> fffffadf`90d7bbc8 fffff800`011aa769 : 00000000`000000c2
>00000000`00000007 00000000`0000121a 00000000`012b0011 : nt!KeBugCheckEx
> fffffadf`90d7bbd0 fffffadf`8f554621 : fffffadf`99483c50
>00000000`00000080 fffffadf`99483c50 00000000`00000080 :
>nt!ExFreePoolWithTag+0x401
> fffffadf`90d7bc90 fffffadf`8f51f568 : fffffadf`9c813c00
>fffffadf`9bddd3e8 fffffadf`99483c78 fffffadf`9bddd3c8 :
>ibbus!async_destroy_cb+0x171
>[d:\openib-windows-svn\1177\gen1\trunk\core\al\al_common.c @ 686]
> fffffadf`90d7bce0 fffffadf`8f521a1d : fffffadf`9c8764e0
>fffffadf`9bddd2b0 fffffadf`9bed0040 fffff800`011b5500 :
>ibbus!__cl_async_proc_worker+0x98
>[d:\openib-windows-svn\1177\gen1\trunk\core\complib\cl_async_proc.c @
>153]
> fffffadf`90d7bd10 fffffadf`8f522108 : 00000000`00000000
>fffffadf`9c8764e0 fffffadf`9c8764e0 fffff800`011b5500 :
>ibbus!__cl_thread_pool_routine+0x4d
>[d:\openib-windows-svn\1177\gen1\trunk\core\complib\cl_threadpool.c @
>66]
> fffffadf`90d7bd40 fffff800`0124b972 : 00000000`00000000
>fffffadf`9beaf040 fffffadf`9beaf040 fffffadf`9c168bf0 :
>ibbus!__thread_callback+0x28
>[d:\openib-windows-svn\1177\gen1\trunk\core\complib\kernel\cl_thread.c @
>49]
> fffffadf`90d7bd70 fffff800`010202d6 : fffff800`011b1180
>fffffadf`9bed0040 fffff800`011b5500 fffffadf`9c8b81c0 :
>nt!PspSystemThreadStartup+0x3e
> fffffadf`90d7bdd0 00000000`00000000 : 00000000`00000000
>00000000`00000000 00000000`00000000 00000000`00000000 :
>nt!KxStartSystemThread+0x16
>
> FOLLOWUP_IP:
> ibbus!async_destroy_cb+171
>[d:\openib-windows-svn\1177\gen1\trunk\core\al\al_common.c @ 686]
> SYMBOL_STACK_INDEX: 2
>
> SYMBOL_NAME: ibbus!async_destroy_cb+171
>
>
>________________________________
>
> From: AndInc at aol.com [mailto:AndInc at aol.com]
> Sent: Friday, June 27, 2008 2:14 AM
> To: sean.hefty at intel.com; Leonid Keller;
>ofw at lists.openfabrics.org
> Subject: OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD
> A simple sequential/random IOMeter script of
>small block writes produces a BSOD in this environment. Trace is below,
>very repeatable, two similar failures in the trace. Any clues about
>what's causing the (a) error (b) disconnect and (c) BSOD?
>
> Thanks,
>
> Mike Anderson
>
> [15513.043769] local QP operation err (QPN
>0c004a, WQE index 39b8, vendor syndrome 6f, opcode = 5e)
> [15513.043777] CQE contents 000c004a 00000000
>00000000 00000000 00000000 00000000 39b86f02 0000005e
> [15513.043779] ib_srpt: failed send status= 2
> [15513.043783] ib_srpt: failed send status= 5
> [15513.043786] ib_srpt: failed send status= 5
> [15513.043801] ib_srpt: failed send status= 5
> [15513.043851] ib_srpt: failed send status= 5
> [15513.043855] ib_srpt: failed send status= 5
> [15513.043857] ib_srpt: failed send status= 5
> [15513.043860] ib_srpt: failed send status= 5
> [15513.043873] ib_srpt: QP event 16 on cm_id=
>ffff8100ba389800 sess_name= 0x0002c9030000a50c0002c9030000a3ec state= 1
> [15513.043877] ib_srpt: Schedule
>CM_DISCONNECT_WORK
> [15513.043967] ib_srpt: srpt_cm_drep_recv[1636]
>cm_id= ffff8100ba389800
> [15513.044220] ib_srpt: srpt_release_channel:
>Release sess= ffff8101c27d3cf0 sess_name=
>0x0002c9030000a50c0002c9030000a3ec active_cmd= 7
> [15513.044223] [6160]:
>scst_unregister_session:4639:Unregistering session ffff8101c27d3cf0
>(wait 0)
> [15739.551108] ib_srpt: ASYNC event= 10 on
>device= mlx4_0
> [15831.623484] ib_srpt: ASYNC event= 17 on
>device= mlx4_0
> [15831.624195] ib_srpt: ASYNC event= 11 on
>device= mlx4_0
> [15831.624400] ib_srpt: ASYNC event= 11 on
>device= mlx4_0
> [15831.636997] ib_srpt: ASYNC event= 9 on
>device= mlx4_0
> [15833.127349] ib_srpt: Host login
>i_port_id=0x2c9030000a50c:0x2c9030000a3ec
>t_port_id=0x2c9030000a50c:0x2c9030000a50c it_iu_len=996
> [15833.128607] ib_srpt: srpt_create_ch_ib[1228]
>max_cqe= 4095 max_sge= 29 cm_id= ffff8101b38b0a00
> [15833.128927] [6823]: scst:
>scst_init_session:4509:Using security group "Default" for initiator
>"0x0002c9030000a50c0002c9030000a3ec"
> [15833.128938] [6823]:
>scst_init_session:4512:Assigning session ffff810100467c30 to acg Default
> [15833.128951] [6823]:
>scst_alloc_add_tgt_dev:405:host=9, channel=0, id=0, lun=0, SCST lun=0
> [15833.128958] [6823]:
>scst_alloc_set_UA:2486:Adding new UA to tgt_dev ffff8101c953de60
> [15833.128980] ib_srpt: Establish connection
>sess= ffff810100467c30 name= 0x0002c9030000a50c0002c9030000a3ec cm_id=
>ffff8101b38b0a00
> [15833.132787] [6818]: scst:
>scst_set_pending_UA:2420:Setting pending UA cmd ffff810100ba66d0
> [15841.612022] ib_srpt: ASYNC event= 11 on
>device= mlx4_0
> [16046.074918] igb: eth1: igb_watchdog_task: NIC
>Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
> [16056.648672] eth1: no IPv6 routers present
> [17209.196025] local QP operation err (QPN
>0e004a, WQE index 3d40, vendor syndrome 6f, opcode = 5e)
> [17209.196032] CQE contents 000e004a 00000000
>00000000 00000000 00000000 00000000 3d406f02 000000de
> [17209.196033] ib_srpt: failed send status= 2
> [17209.196037] ib_srpt: failed send status= 5
> [17209.196040] ib_srpt: failed send status= 5
> [17209.196044] ib_srpt: failed send status= 5
> [17209.196069] ib_srpt: QP event 16 on cm_id=
>ffff8101b38b0a00 sess_name= 0x0002c9030000a50c0002c9030000a3ec state= 1
> [17209.196074] ib_srpt: Schedule
>CM_DISCONNECT_WORK
> [17209.196078] ib_srpt: srpt_xmit_response[1960]
>tag= 10296991 channel in bad state 2
> [17209.196083] ib_srpt: failed send status= 5
> [17209.196089] [6820]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.196099] ib_srpt: srpt_xmit_response[1960]
>tag= 10296992 channel in bad state 2
> [17209.196104] [6819]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.196157] ib_srpt: srpt_xmit_response[1960]
>tag= 10296993 channel in bad state 2
> [17209.196160] [6817]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.196173] ib_srpt: srpt_cm_drep_recv[1636]
>cm_id= ffff8101b38b0a00
> [17209.196179] ib_srpt: srpt_xmit_response[1960]
>tag= 10296994 channel in bad state 2
> [17209.196182] [6814]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.196265] ib_srpt: srpt_xmit_response[1960]
>tag= 10296995 channel in bad state 2
> [17209.196269] [6818]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.196277] ib_srpt: srpt_xmit_response[1960]
>tag= 10296996 channel in bad state 2
> [17209.196278] [6818]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.196308] ib_srpt: srpt_xmit_response[1960]
>tag= 10296997 channel in bad state 2
> [17209.196309] [6815]: scst:
>scst_xmit_response:2590:***ERROR*** Target driver ib_srpt
>xmit_response() returned fatal error
> [17209.197269] ib_srpt: srpt_release_channel:
>Release sess= ffff810100467c30 sess_name=
>0x0002c9030000a50c0002c9030000a3ec active_cmd= 3
> [17209.197272] [6158]:
>scst_unregister_session:4639:Unregistering session ffff810100467c30
>(wait 0)
> linux-gen24:~ #
>
>
>________________________________
>
> Gas prices getting you down? Search AOL Autos
>for fuel-efficient used cars
><http://autos.aol.com/used?ncid=aolaut00050000000007> .
>
More information about the ofw
mailing list