[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?

Leonid Keller leonid at mellanox.co.il
Mon Mar 16 02:29:07 PDT 2009


See inline 

> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Monday, March 16, 2009 8:56 AM
> To: Leonid Keller; ofw at lists.openfabrics.org
> Cc: Tzachi Dar
> Subject: RE: [ofw] crash in mlx4 driver - or maybe it's an 
> ipoib issue?
> 
> >Seems like it is another problem and maybe it is a 
> consequence of a new 
> >driver behavior upon HCA fatal error.
> >It now resets the card to bring it to a known state.
> >Seems like IPoIB is not ready for that.
> >Reference counter = 0x203 brings the idea, that IPoIB takes 
> a reference 
> >every time, when it posts a send or a receive WQE.
> >It intends to make a dereference on completion, but the reset card 
> >doesn't produce completions.
> >So it gets stuck in destory_obj loop, asserting once in 10 seconds, 
> >that the ref_cnt is still high.
> >Tzachi, could you check my "theory" ?
> 
> This is what my thoughts were as well.  It looks like there 
> are multiple problems being exposed here, all starting with 
> whatever is causing the HCA to hit a fatal error.
> 
> Does a fatal error *only* occur based on reading back some 
> value from the hardware?  Or, is it possible for a software 
> bug to trigger this?  (I can't but assume that something in 
> the winverbs driver is starting this reaction.)
> 
We also reset the card in mlx4_cmd_wait() on getting a timeout on a
command.
(Because it usually means that the card is partly or fully stuck.)
You should a message like "mlx4_cmd_wait: Command %02x completed with
timeout after %d msecs \n".
But also in this case the reason comes from HW.

> 



More information about the ofw mailing list