[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?
Leonid Keller
leonid at mellanox.co.il
Mon Mar 16 02:29:07 PDT 2009
See inline
> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com]
> Sent: Monday, March 16, 2009 8:56 AM
> To: Leonid Keller; ofw at lists.openfabrics.org
> Cc: Tzachi Dar
> Subject: RE: [ofw] crash in mlx4 driver - or maybe it's an
> ipoib issue?
>
> >Seems like it is another problem and maybe it is a
> consequence of a new
> >driver behavior upon HCA fatal error.
> >It now resets the card to bring it to a known state.
> >Seems like IPoIB is not ready for that.
> >Reference counter = 0x203 brings the idea, that IPoIB takes
> a reference
> >every time, when it posts a send or a receive WQE.
> >It intends to make a dereference on completion, but the reset card
> >doesn't produce completions.
> >So it gets stuck in destory_obj loop, asserting once in 10 seconds,
> >that the ref_cnt is still high.
> >Tzachi, could you check my "theory" ?
>
> This is what my thoughts were as well. It looks like there
> are multiple problems being exposed here, all starting with
> whatever is causing the HCA to hit a fatal error.
>
> Does a fatal error *only* occur based on reading back some
> value from the hardware? Or, is it possible for a software
> bug to trigger this? (I can't but assume that something in
> the winverbs driver is starting this reaction.)
>
We also reset the card in mlx4_cmd_wait() on getting a timeout on a
command.
(Because it usually means that the card is partly or fully stuck.)
You should a message like "mlx4_cmd_wait: Command %02x completed with
timeout after %d msecs \n".
But also in this case the reason comes from HW.
>
More information about the ofw
mailing list