[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?

Mon Mar 16 01:04:22 PDT 2009

As for the ipoib code: I haven't looked at it thoroughly but the driver
is not ready to handle fatal error of the hw. For example if sends are
being posted on a QP, than there is no way for it to free this memory
unless it reads the completion from the CQ. Changing this is not only
complicated but it also has performance impact.

As for detecting internal errors:
Internal errors can be detected by two means:
1) Reading a value from the hw.
2) A command that is not completed on time.

Both can be caused by sw that is doing a wrong command.

Here are a few quick things to check:
1) What device are you using?
2) Are you using the latest fw?
3) Can you try this with a different card?
4) I have made a checkin that changes the behavior of internal error (in
checkin 2001).
Can you try running before this checkin and after it? I want to see if
it has an influence on the problem.

Thanks
Tzachi

> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Monday, March 16, 2009 8:56 AM
> To: Leonid Keller; ofw at lists.openfabrics.org
> Cc: Tzachi Dar
> Subject: RE: [ofw] crash in mlx4 driver - or maybe it's an 
> ipoib issue?
> 
> >Seems like it is another problem and maybe it is a 
> consequence of a new 
> >driver behavior upon HCA fatal error.
> >It now resets the card to bring it to a known state.
> >Seems like IPoIB is not ready for that.
> >Reference counter = 0x203 brings the idea, that IPoIB takes 
> a reference 
> >every time, when it posts a send or a receive WQE.
> >It intends to make a dereference on completion, but the reset card 
> >doesn't produce completions.
> >So it gets stuck in destory_obj loop, asserting once in 10 seconds, 
> >that the ref_cnt is still high.
> >Tzachi, could you check my "theory" ?
> 
> This is what my thoughts were as well.  It looks like there 
> are multiple problems being exposed here, all starting with 
> whatever is causing the HCA to hit a fatal error.
> 
> Does a fatal error *only* occur based on reading back some 
> value from the hardware?  Or, is it possible for a software 
> bug to trigger this?  (I can't but assume that something in 
> the winverbs driver is starting this reaction.)
> 
>