[ofw] crash in mlx4 driver
Leonid Keller
leonid at mellanox.co.il
Sun Mar 15 08:44:09 PDT 2009
Seems like ibv_um_open() failed, but I didn't see any messages.
I'd suggest to fix the return code test after the call, to rebuild all
the drivers - to be sure, they have consistent structures - and to try
once more (with all checked drivers to see error messages).
> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty
> Sent: Friday, March 13, 2009 12:16 AM
> To: ofw at lists.openfabrics.org
> Subject: [ofw] crash in mlx4 driver
>
> I hit the following crash in mlx4_hca. (SVN version was
> updated last night, so it's recent.) My guess is that
> winverbs may be doing something wrong, but here's the stack
> trace and corresponding source code in the mlx4 driver:
>
> DEFAULT_BUCKET_ID: DRIVER_FAULT
>
> BUGCHECK_STR: 0xBE
>
> PROCESS_NAME: dtest2d.exe
>
> CURRENT_IRQL: f
>
> TRAP_FRAME: fffffadf8e293370 -- (.trap 0xfffffadf8e293370)
> NOTE: The trap frame does not contain all registers.
> Some register values may be zeroed or incorrect.
> rax=fffffadf8a58ac82 rbx=fffffadf8e2934a0
> rcx=0000000000000000 rdx=0000000000000000
> rsi=0000000000000008 rdi=0000000000000000
> rip=fffffadf8a6de8c2 rsp=fffffadf8e293508
> rbp=0000000000000008 r8=0000000000000050
> r9=0000000000000000 r10=0000000000000000 r11=0000000000000000
> r12=0000000000000000 r13=0000000000000000
> r14=0000000000000000 r15=0000000000000000
> iopl=0 nv up ei pl zr na po nc
> mlx4_hca!atomic_set+0x12:
> fffffadf`8a6de8c2 8908 mov dword ptr [rax],ecx
> ds:0008:fffffadf`8a58ac82=89481024
> Resetting default scope
>
> LAST_CONTROL_TRANSFER: from fffff8000107984c to fffff80001026cf0
>
> STACK_TEXT:
> fffffadf`8e292a18 fffff800`0107984c : 0000fadf`8c671f62
> 00000000`0000edde 00000000`00000000 00000000`00000000 :
> nt!DbgBreakPointWithStatus fffffadf`8e292a20
> fffff800`010c517e : 00000000`0ee40000 00000000`dffe0000
> 00000000`0ee40000 fffffadf`9a8e1840 :
> nt!KdCheckForDebugBreak+0xb5 fffffadf`8e292a60
> fffff800`010d89eb : fffffadf`8a58ac00 fffffadf`8e293370
> 00000000`00000001 00000000`000000be :
> nt!IoWriteCrashDump+0x851 fffffadf`8e292c20 fffff800`0102e994
> : fffffadf`8e293360 00000000`00000000 00000001`00000000
> 00000000`00000001 : nt!KeBugCheck2+0xb83 fffffadf`8e293260
> fffff800`010a5c05 : 00000000`000000be fffffadf`8a58ac82
> 00000000`c4e84121 fffffadf`8e293370 : nt!KeBugCheckEx+0x104
> fffffadf`8e2932a0 fffff800`0102d459 : 00000000`00000001
> fffffadf`901ef1cc fffffadf`8e293800 fffffa80`00100660 :
> nt!MmAccessFault+0x503 fffffadf`8e293370 fffffadf`8a6de8c2 :
> fffffadf`8a6deb4c fffffadf`8a58ac82 fffffadf`00000000
> fffffadf`8e293538 : nt!KiPageFault+0x119
> fffffadf`8e293508 fffffadf`8a6deb4c : fffffadf`8a58ac82
> fffffadf`00000000
> fffffadf`8e293538 fffff800`01288000 :
> mlx4_hca!atomic_set+0x12
> [c:\mshefty\scm\winof\branches\winverbs\hw\mlx4\kernel\inc\l2w
> _atomic.h @ 17] fffffadf`8e293510 fffffadf`8a58cf34 :
> fffffadf`9a77cd00 fffffadf`8e293610
> fffffadf`9aa9b118 fffffadf`97ea5910 :
> mlx4_hca!mlnx_um_open+0x27c
> [c:\mshefty\scm\winof\branches\winverbs\hw\mlx4\kernel\hca\vp.
> c @ 99] fffffadf`8e293590 fffffadf`8a58d0ea :
> fffffadf`9aa9b0f0 20a60000`03c90200 fffffadf`8e293610
> 00000000`00000038 : winverbs!WvDeviceInit+0x84
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\kernel\w
> v_device.c @ 249] fffffadf`8e2935e0 fffffadf`8a58b0b5 :
> fffffadf`9a2d90d0 00000520`6815a6e8 00000000`00000000
> fffffadf`97ea5910 : winverbs!WvDeviceOpen+0x11a
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\kernel\w
> v_device.c @ 299] fffffadf`8e293670 fffffadf`9051e0b9 :
> 00000520`66e88398 00000520`6815a6e8
> 00000000`00000048 00000000`00000010 :
> winverbs!WvIoDeviceControl+0xa5
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\kernel\w
> v_driver.c @ 224] fffffadf`8e2936c0 fffffadf`9051d59e :
> 00000520`6815a6e8 00000520`6815a6e8 fffffadf`9991ba90
> fffffadf`97ea5910 :
> wdf01000!FxIoQueue::DispatchRequestToDriver+0x6d9
> fffffadf`8e293760 fffffadf`9051c8b6 : fffffadf`99177c60
> 00000000`00000000 fffffadf`99177c00 fffffadf`97f00021 :
> wdf01000!FxIoQueue::DispatchEvents+0x83e
> fffffadf`8e2938c0 fffffadf`90523998 : fffffadf`98791c00
> fffffadf`98791cb0
> 00000520`66e88398 00000520`6815a6e8 :
> wdf01000!FxIoQueue::QueueRequest+0x4a6
> fffffadf`8e293970 fffffadf`90507865 : fffffadf`96588d0e
> fffffadf`97ea5910 fffffadf`98791cb0 fffffadf`9a6a9480 :
> wdf01000!FxPkgIo::Dispatch+0x718 fffffadf`8e293a40
> fffff800`0127f111 : 00000000`00000010 fffffadf`8e293cf0
> 00000000`00000000 fffffadf`9a65bf40 :
> wdf01000!FxDevice::Dispatch+0xa9 fffffadf`8e293a70
> fffff800`0127ec16 : 00000000`00000000 00000000`00000341
> 00000000`00000000 00000000`00000000 :
> nt!IopXxxControlFile+0xa79 fffffadf`8e293b90
> fffff800`0102e33d : 00000000`00000000 fffffadf`8e293c40
> fffffadf`00000000 00000000`00000000 :
> nt!NtDeviceIoControlFile+0x56 fffffadf`8e293c00
> 00000000`77ef0a5a : 00000000`77d5effa 00000000`00000000
> 00000000`00000000 00000000`000af1e0 : nt!KiSystemServiceCopyEnd+0x3
> 00000000`000af028 00000000`77d5effa : 00000000`00000000
> 00000000`00000000 00000000`000af1e0 00000000`000c02c8 :
> ntdll!NtDeviceIoControlFile+0xa 00000000`000af030
> 00000000`00492548 : 00000000`000ce300 00000000`00000000
> 00000001`00001290 00000000`01d15eec :
> kernel32!DeviceIoControl+0x163 00000000`000af210
> 00000000`00493943 : 00000000`000ccae8 00000000`0000034c
> 00000000`003be00c 00000000`000af310 :
> winverbsd!CWVBase::WvDeviceIoControl+0x88
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\user\wv_
> base.cpp @ 95] 00000000`000af270 00000000`00497cc6 :
> 00000000`000ccae0 20a60000`03c90200 00000000`00000000
> 00000000`00000000 : winverbsd!CWVDevice::Open+0x363
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\user\wv_
> device.cpp @ 105] 00000000`000af400 00000000`00498457 :
> 00000000`000cc2a0 20a60000`03c90200
> 00000000`000af4a8 00000001`00001290 :
> winverbsd!CWVDevice::CreateInstance+0x96
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\user\wv_
> device.h @ 85] 00000000`000af450 00000000`0049824d :
> 00000000`000cc2a0 20a60000`03c90200
> 00000000`000af4a8 00002b99`00000000 :
> winverbsd!CWVProvider::OpenDevice+0x27
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\user\wv_
> provider.cpp @ 181] 00000000`000af480 00000000`003f4b16 :
> 00000000`000cc2a0 20a60000`03c90200 00000000`000af4f0
> 00000000`00000000 : winverbsd!CWVProvider::QueryDevice+0x2d
> [c:\mshefty\scm\winof\branches\winverbs\core\winverbs\user\wv_
> provider.cpp @ 125] 00000000`000af4c0 00000000`003db693 :
> 00000000`00000000 00000000`003c4898 00000000`002abfe0
> 00000000`002e98e0 : libibverbsd!ibv_get_device_list+0x266
> [c:\mshefty\scm\winof\branches\winverbs\ulp\libibverbs\src\dev
> ice.cpp @ 140] 00000000`000af640 00000000`003cf1f2 :
> 00000000`002abfe0 00000000`002e98e0 00000001`0000b420
> 00000000`00000008 : dapl2_ofa_scmd!dapls_ib_open_hca+0x43
> [c:\mshefty\scm\winof\branches\winverbs\ulp\dapl2\dapl\openib_
> scm\dapl_ib_util.c
> @ 255]
> 00000000`000af6b0 00000000`00402dde : 00000001`0000b420
> 00000000`00000008 00000001`0000bc50 00000001`0000bc28 :
> dapl2_ofa_scmd!dapl_ia_open+0x112
> [c:\mshefty\scm\winof\branches\winverbs\ulp\dapl2\dapl\common\
> dapl_ia_open.c @ 146] 00000000`000af720 00000001`00003bb9 :
> 00000001`0000b420 00000000`00000008 00000001`0000bc50
> 00000001`0000bc28 : dat2d!dat_ia_openv+0x17e
> [c:\mshefty\scm\winof\branches\winverbs\ulp\dapl2\dat\udat\uda
> t.c @ 234] 00000000`000afb60 00000001`00009789 :
> 00000000`00000003 00000000`002aaea0 00000000`00000000
> 00000000`00000001 : dtest2d!main+0x4b9
> [c:\mshefty\scm\winof\branches\winverbs\ulp\dapl2\test\dtest\d
> test.c @ 342] 00000000`000aff40 00000000`77d5964c :
> 00000000`00000000 00000000`00000000 00000000`00000000
> 00000000`000affa8 : dtest2d!__mainCRTStartup+0x13d
> [d:\longhorn_rc0\base\crts\crtw32\dllstuff\crtexe.c @ 716]
> 00000000`000aff80 00000000`00000000 : 00000001`000098f4
> 00000000`00000000 00000000`00000000 00000000`00000000 :
> kernel32!BaseProcessStart+0x29
>
>
> STACK_COMMAND: kb
>
> FOLLOWUP_IP:
> mlx4_hca!atomic_set+12
> [c:\mshefty\scm\winof\branches\winverbs\hw\mlx4\kernel\inc\l2w
> _atomic.h @ 17]
> fffffadf`8a6de8c2 8908 mov dword ptr [rax],ecx
>
> FAULTING_SOURCE_CODE:
> 13: }
> 14:
> 15: static inline void atomic_set(atomic_t *pval, long val)
> 16: {
> > 17: *pval = (__int32)val;
> 18: }
> 19:
> 20: /**
> 21: * atomic_inc_and_test - decrement and test
> 22: * pval: pointer of type atomic_t
>
>
> SYMBOL_STACK_INDEX: 7
>
> SYMBOL_NAME: mlx4_hca!atomic_set+12
>
> FOLLOWUP_NAME: MachineOwner
>
> MODULE_NAME: mlx4_hca
>
> IMAGE_NAME: mlx4_hca.sys
>
> DEBUG_FLR_IMAGE_TIMESTAMP: 49ad96d9
>
> FAILURE_BUCKET_ID: X64_0xBE_mlx4_hca!atomic_set+12
>
> BUCKET_ID: X64_0xBE_mlx4_hca!atomic_set+12
>
> Followup: MachineOwner
> ---------
>
> **** source ****
> note: p_umv_buf->command == 1
>
> static ib_api_status_t
> mlnx_um_open(
> IN const ib_ca_handle_t
> h_ca,
> IN OUT ci_umv_buf_t* const
> p_umv_buf,
> OUT ib_ca_handle_t* const
> ph_um_ca
> )
> {
> ib_api_status_t status;
> mlnx_hca_t *p_hca = (mlnx_hca_t *)h_ca;
> PFDO_DEVICE_DATA p_fdo = hca2fdo(p_hca);
> struct ib_device *p_ibdev = hca2ibdev(p_hca);
> struct ib_ucontext *p_uctx;
> struct ibv_get_context_resp *p_uresp;
>
> HCA_ENTER(HCA_DBG_SHIM);
>
> // sanity check
> ASSERT( p_umv_buf );
> if( !p_umv_buf->command )
> { // no User Verb Provider
> p_uctx = cl_zalloc( sizeof(struct ib_ucontext) );
> if( !p_uctx )
> {
> status = IB_INSUFFICIENT_MEMORY;
> goto err_alloc_ucontext;
> }
> /* Copy the dev info. */
> p_uctx->device = p_ibdev;
> p_umv_buf->output_size = 0;
> status = IB_SUCCESS;
> goto done;
> }
>
> // sanity check
> if ( p_umv_buf->output_size < sizeof(struct
> ibv_get_context_resp) ||
> !p_umv_buf->p_inout_buf) {
> status = IB_INVALID_PARAMETER;
> goto err_inval_params;
> }
>
> status = ibv_um_open( p_ibdev, p_umv_buf, &p_uctx );
> if (!NT_SUCCESS(status)) {
> goto end;
> }
>
> // fill more parameters for user (sanity checks are in
> mthca_alloc_ucontext)
> p_uresp = (struct ibv_get_context_resp
> *)(ULONG_PTR)p_umv_buf->p_inout_buf;
> p_uresp->vend_id =
> (uint32_t)p_fdo->bus_ib_ifc.pdev->ven_id;
> p_uresp->dev_id =
> (uint16_t)p_fdo->bus_ib_ifc.pdev->dev_id;
> p_uresp->max_qp_wr =
> hca2mdev(p_hca)->caps.max_wqes;
> p_uresp->max_cqe =
> hca2mdev(p_hca)->caps.max_cqes;
> p_uresp->max_sge = min(
> hca2mdev(p_hca)->caps.max_sq_sg,
> hca2mdev(p_hca)->caps.max_rq_sg );
>
> done:
> // fill the rest of ib_ucontext_ex fields
> atomic_set(&p_uctx->x.usecnt, 0);
>
> ***** crash here ^^^, p_uctx is not NULL, but apparently invalid
>
> p_uctx->x.va = p_uctx->x.p_mdl = NULL;
> p_uctx->x.fw_if_open = FALSE;
> mutex_init( &p_uctx->x.mutex );
>
> // chain user context to the device
> spin_lock( &p_fdo->uctx_lock );
> cl_qlist_insert_tail( &p_fdo->uctx_list, &p_uctx->x.list_item );
> cl_atomic_inc(&p_fdo->usecnt);
> spin_unlock( &p_fdo->uctx_lock );
>
> // return the result
> if (ph_um_ca) *ph_um_ca = (ib_ca_handle_t)p_uctx;
>
> status = IB_SUCCESS;
> goto end;
>
> err_inval_params:
> err_alloc_ucontext:
> end:
> if (p_umv_buf && p_umv_buf->command)
> p_umv_buf->status = status;
> if (status != IB_SUCCESS)
> {
> HCA_PRINT(TRACE_LEVEL_ERROR,HCA_DBG_SHIM,
> ("completes with ERROR status %x\n", status));
> }
> HCA_EXIT(HCA_DBG_SHIM);
> return status;
> }
>
> I'll keep looking into this, but if anyone has any ideas,
> please let me know.
> (This crash occurred after running a bunch of random
> libibverbs/librdmacm/perftest tests, followed by running
> dtest successfully once. The second running of dtest
> generated this crash.
>
> - Sean
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>
More information about the ofw
mailing list