[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?

Leonid Keller leonid at mellanox.co.il
Sun Mar 15 08:58:25 PDT 2009


Seems like it is another problem and maybe it is a consequence of a new
driver behavior upon HCA fatal error.
It now resets the card to bring it to a known state.
Seems like IPoIB is not ready for that.
Reference counter = 0x203 brings the idea, that IPoIB takes a reference
every time, when it posts a send or a receive WQE.
It intends to make a dereference on completion, but the reset card
doesn't produce completions.
So it gets stuck in destory_obj loop, asserting once in 10 seconds, that
the ref_cnt is still high.
Tzachi, could you check my "theory" ?

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty
> Sent: Friday, March 13, 2009 9:38 PM
> To: Hefty, Sean; ofw at lists.openfabrics.org
> Subject: RE: [ofw] crash in mlx4 driver - or maybe it's an 
> ipoib issue?
> 
> This is either more random details or a completely separate 
> problem.  This time I'm running with a debugger attached 
> during the test.
> 
> After running dtest2 using libibverbs and the socket CM, the 
> test completes successfully, but then this occurs in the kernel:
> 
> CQ overrun on CQN 00008f
> 
> Detected catastrophic error on mdev FFFFFADF9C409000
> 
> ~1:[MLX4_BUS] pci_get_msi_info() :MSI-X Capability: Enabled - 
> 0, Function Masked 0, Vectors Supported 256, Addr_Offset(BIR) 
> 0(4), Pend_Offset(BIR) 0x1000(4) ~1:[MLX4_BUS] 
> pci_get_msi_info() :MSI-X Vectors: Allocated 0 vectors 
> ~1:[MLX4_BUS] pci_hca_reset() :
> Resetting HCA ... 
> 
> ~1:[MLX4_BUS] pci_get_msi_info() :MSI-X Capability: Enabled - 
> 0, Function Masked 0, Vectors Supported 256, Addr_Offset(BIR) 
> 0(4), Pend_Offset(BIR) 0x1000(4) ~1:[MLX4_BUS] 
> pci_get_msi_info() :MSI-X Vectors: Allocated 0 vectors 
> ~1:[MLX4_BUS] pci_hca_reset() :HCA has been reset ! 
> Internal error detected:
> 
> {snip - a bunch of null buffers}
> 
> ~1:[MLX4_HCA] mlnx_query_ca() :***ERROR***  ib_query_device 
> failed (-14) ~1:[MLX4_HCA] mlnx_query_ca() :***ERROR***  
> completes with ERROR status 2b ~1:[MLX4_HCA] mlnx_post_send() 
> :***ERROR***  post_send failed with status 2b
> [IPoIB]:ipoib_port_send() !ERROR!: ib_post_send returned IB_ERROR
> [IPoIB]:NdisMSendCompleteX() !ERROR!: Sending status other 
> than Success to NDIS ~1:[MLX4_HCA] mlnx_post_send() 
> :***ERROR***  post_send failed with status 2b
> [IPoIB]:ipoib_port_send() !ERROR!: ib_post_send returned IB_ERROR
> [IPoIB]:NdisMSendCompleteX() !ERROR!: Sending status other 
> than Success to NDIS ~0:[MLX4_HCA] mlnx_post_send() 
> :***ERROR***  post_send failed with status 2b ~0:[MLX4_HCA] 
> mlnx_post_send() :***ERROR***  post_send failed with status 
> 2b ~0:[MLX4_HCA] mlnx_modify_qp() :***ERROR***  ibv_modify_qp 
> failed (-14) ~0:[MLX4_HCA] mlnx_modify_qp() :***ERROR***  
> completes with ERROR status 2b
> [IPoIB]:ipoib_port_down() !ERROR!: ib_modify_qp to error 
> state returned IB_ERROR.
> ~0:[MLX4_HCA] mlnx_post_send() :***ERROR***  post_send failed 
> with status 2b ~0:[MLX4_HCA] mlnx_post_send() :***ERROR***  
> post_send failed with status 2b ~0:[MLX4_HCA] 
> mlnx_post_send() :***ERROR***  post_send failed with status 2b
> 
> *** Assertion failed: !p_obj->ref_cnt
> ***   Source File: 
> c:\mshefty\scm\winof\branches\winverbs\core\complib\cl_obj.c,
> line 701
> 
> ipoib!__destroy_obj
> iopib!cl_obj_destroy
> ipoib!ipoib_port_destroy
> ipoib!__ipoib_adapter_reset
> 
> p_adapter->state is set to 0x1002  (which I believe is add 
> port) p_port->obj.ref_cnt is set to 0x203
> 
> If I ignore the assertion, it repeats itself roughly every 10 
> seconds.  See my other reply regarding a bug in the error 
> handling in the mlx4 code.
> 
> - Sean
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 



More information about the ofw mailing list