[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?
Leonid Keller
leonid at mellanox.co.il
Sun Mar 15 08:58:25 PDT 2009
Seems like it is another problem and maybe it is a consequence of a new
driver behavior upon HCA fatal error.
It now resets the card to bring it to a known state.
Seems like IPoIB is not ready for that.
Reference counter = 0x203 brings the idea, that IPoIB takes a reference
every time, when it posts a send or a receive WQE.
It intends to make a dereference on completion, but the reset card
doesn't produce completions.
So it gets stuck in destory_obj loop, asserting once in 10 seconds, that
the ref_cnt is still high.
Tzachi, could you check my "theory" ?
> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty
> Sent: Friday, March 13, 2009 9:38 PM
> To: Hefty, Sean; ofw at lists.openfabrics.org
> Subject: RE: [ofw] crash in mlx4 driver - or maybe it's an
> ipoib issue?
>
> This is either more random details or a completely separate
> problem. This time I'm running with a debugger attached
> during the test.
>
> After running dtest2 using libibverbs and the socket CM, the
> test completes successfully, but then this occurs in the kernel:
>
> CQ overrun on CQN 00008f
>
> Detected catastrophic error on mdev FFFFFADF9C409000
>
> ~1:[MLX4_BUS] pci_get_msi_info() :MSI-X Capability: Enabled -
> 0, Function Masked 0, Vectors Supported 256, Addr_Offset(BIR)
> 0(4), Pend_Offset(BIR) 0x1000(4) ~1:[MLX4_BUS]
> pci_get_msi_info() :MSI-X Vectors: Allocated 0 vectors
> ~1:[MLX4_BUS] pci_hca_reset() :
> Resetting HCA ...
>
> ~1:[MLX4_BUS] pci_get_msi_info() :MSI-X Capability: Enabled -
> 0, Function Masked 0, Vectors Supported 256, Addr_Offset(BIR)
> 0(4), Pend_Offset(BIR) 0x1000(4) ~1:[MLX4_BUS]
> pci_get_msi_info() :MSI-X Vectors: Allocated 0 vectors
> ~1:[MLX4_BUS] pci_hca_reset() :HCA has been reset !
> Internal error detected:
>
> {snip - a bunch of null buffers}
>
> ~1:[MLX4_HCA] mlnx_query_ca() :***ERROR*** ib_query_device
> failed (-14) ~1:[MLX4_HCA] mlnx_query_ca() :***ERROR***
> completes with ERROR status 2b ~1:[MLX4_HCA] mlnx_post_send()
> :***ERROR*** post_send failed with status 2b
> [IPoIB]:ipoib_port_send() !ERROR!: ib_post_send returned IB_ERROR
> [IPoIB]:NdisMSendCompleteX() !ERROR!: Sending status other
> than Success to NDIS ~1:[MLX4_HCA] mlnx_post_send()
> :***ERROR*** post_send failed with status 2b
> [IPoIB]:ipoib_port_send() !ERROR!: ib_post_send returned IB_ERROR
> [IPoIB]:NdisMSendCompleteX() !ERROR!: Sending status other
> than Success to NDIS ~0:[MLX4_HCA] mlnx_post_send()
> :***ERROR*** post_send failed with status 2b ~0:[MLX4_HCA]
> mlnx_post_send() :***ERROR*** post_send failed with status
> 2b ~0:[MLX4_HCA] mlnx_modify_qp() :***ERROR*** ibv_modify_qp
> failed (-14) ~0:[MLX4_HCA] mlnx_modify_qp() :***ERROR***
> completes with ERROR status 2b
> [IPoIB]:ipoib_port_down() !ERROR!: ib_modify_qp to error
> state returned IB_ERROR.
> ~0:[MLX4_HCA] mlnx_post_send() :***ERROR*** post_send failed
> with status 2b ~0:[MLX4_HCA] mlnx_post_send() :***ERROR***
> post_send failed with status 2b ~0:[MLX4_HCA]
> mlnx_post_send() :***ERROR*** post_send failed with status 2b
>
> *** Assertion failed: !p_obj->ref_cnt
> *** Source File:
> c:\mshefty\scm\winof\branches\winverbs\core\complib\cl_obj.c,
> line 701
>
> ipoib!__destroy_obj
> iopib!cl_obj_destroy
> ipoib!ipoib_port_destroy
> ipoib!__ipoib_adapter_reset
>
> p_adapter->state is set to 0x1002 (which I believe is add
> port) p_port->obj.ref_cnt is set to 0x203
>
> If I ignore the assertion, it repeats itself roughly every 10
> seconds. See my other reply regarding a bug in the error
> handling in the mlx4 code.
>
> - Sean
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>
More information about the ofw
mailing list