[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?
    Sean Hefty 
    sean.hefty at intel.com
       
    Sun Mar 15 23:56:04 PDT 2009
    
    
  
>Seems like it is another problem and maybe it is a consequence of a new
>driver behavior upon HCA fatal error.
>It now resets the card to bring it to a known state.
>Seems like IPoIB is not ready for that.
>Reference counter = 0x203 brings the idea, that IPoIB takes a reference
>every time, when it posts a send or a receive WQE.
>It intends to make a dereference on completion, but the reset card
>doesn't produce completions.
>So it gets stuck in destory_obj loop, asserting once in 10 seconds, that
>the ref_cnt is still high.
>Tzachi, could you check my "theory" ?
This is what my thoughts were as well.  It looks like there are multiple
problems being exposed here, all starting with whatever is causing the HCA to
hit a fatal error.
Does a fatal error *only* occur based on reading back some value from the
hardware?  Or, is it possible for a software bug to trigger this?  (I can't but
assume that something in the winverbs driver is starting this reaction.)
    
    
More information about the ofw
mailing list