[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?

Sean Hefty sean.hefty at intel.com
Sun Mar 15 23:56:04 PDT 2009


>Seems like it is another problem and maybe it is a consequence of a new
>driver behavior upon HCA fatal error.
>It now resets the card to bring it to a known state.
>Seems like IPoIB is not ready for that.
>Reference counter = 0x203 brings the idea, that IPoIB takes a reference
>every time, when it posts a send or a receive WQE.
>It intends to make a dereference on completion, but the reset card
>doesn't produce completions.
>So it gets stuck in destory_obj loop, asserting once in 10 seconds, that
>the ref_cnt is still high.
>Tzachi, could you check my "theory" ?

This is what my thoughts were as well.  It looks like there are multiple
problems being exposed here, all starting with whatever is causing the HCA to
hit a fatal error.

Does a fatal error *only* occur based on reading back some value from the
hardware?  Or, is it possible for a software bug to trigger this?  (I can't but
assume that something in the winverbs driver is starting this reaction.)




More information about the ofw mailing list